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What this thesis is about 

This thesis presents a new methodology called Multi Layer Analysis (MLA) that acts 
a transformation from the space of one-dimensional signals to a new space called space 
of intervals. The main idea of this approach, shared by several other ones, is the 
decomposition of the input signal into basic features that allows to better extract its 
useful information. 

The main motivation of this study was to develop a new high scalable methodology 
in order to extract shape information from one-dimensional signals. This because a 
lot of real problems fall in this context. In fact, several application domains such as 
Geology, Biomedicine and Biology require the analysis of one-dimensional signals in 
which their features are encoded in the shapes of whole signals or on the shapes of 
their sub-fragments (e.g seismic signals, ECG tracks or chip-chip or chip-seq tracks). 
The kind of analysis obviously depends on the application domains but usually involves 
Pattern Discovery, Clustering or Classification methodologies. The main advantages of 
the MLA compared to other similar methods, are its scalability and the possibility to 
represent a one-dimensional signal in terms of a tree of intervals, and this permits to 
express or characterize explicitly any kind of shape. Consequently, this has strong im- 
plications since it establishes a connection between the class of algorithms that process 
one dimensional signals, such as digital signal processing techniques, and algorithms on 
trees and graphs. 

Contributions and Thesis Outline 

The MLA methodology can be used as preprocessing step in different fields of applica- 
tion e.g.: Classification, Clustering, Pattern Discovery and Test of Randomness. Thus, 
it can be used as tool in the field of data analysis. More in details: 

• This method has been applied to the biological problem of nucleosome positioning 
providing similar performances to the state of the art method, but better scala- 
bility and computational time. This is a fundamental point because it allows to 
analyze more complex organisms. It is also able to recover the positions of fuzzy 
nucleosomes. 

• A new nonparametric test of randomness based on MLA, that exploits shape 
features that are rare in a random signal, was developed. 
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It allows to map a one-dimensional signal in a tree of intervals. Consequently 
some tree kernels, used in different contexts, have been adapted to this represen- 
tation, providing new kernels that explicitly encode the shape information of a 
one-dimensional signal expressed as a tree of intervals. 

The mapping of a one-dimensional signal in a tree of intervals creates a new 
and important connection between two fundamental classes of algorithms: signal 
processing algorithms and algorithms on trees and graphs. 



Chapter 1 presents the motivations of ML A, focusing on different methodologies 
that exploit and share the same idea. Some approaches, at first sight disjointed, 
but actually exploiting the same idea of multi-resolution or multi- views analysis, 
are presented. Some aspects of these methods are related to the MLA analysis; in 
particular similarities or advantages of one method with respect to the others are 
highlighted. In addition, all the basic definitions of the problems where the MLA can 
be productively applied are briefly given. 

Chapter 2 provides a detailed and formal description of the MLA, explaining step 
by step the MLA transformation and highlighting its limits and properties. Finally, 
some general guidelines on how to use the MLA as a preprocessing step for several 
problems are provided. 

Chapter 3 explains how MLA can be integrated in the context of Pattern Discovery 
and Classification. In addition, a case study that regards a particular biological 
problem in which the MLA was successfully used is introduced: the nucleosome 
spacing. Moreover, an alternative approach for the same problem based on Hidden 
Markov Model and a comparison of the two methods are presented. Finally, the last 
section is devoted to the description of a new one-class classifier that was used as new 
classifier module of the MLA. 

Chapter 4 presents a new nonparametric test of randomness applicable to a set of 
one-dimensional signals that takes advantage of MLA preprocessing step. In particular, 
this procedure is based on the probability density function of the symmetrized 
Kullback-Leibler distance, estimated via a Monte Carlo simulation on the intervals 
lengths obtained by MLA. The main advantage of this new approach is to perform 
an exploratory analysis in order to directly verify the presence of several kinds of 
structures in an input signal. In particular, this test differs from the other approaches 
since it exploits shape features that are rare in a random signal. 
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Chapter 5 presents how the MLA can help on designing new kernel functions that 
explicitly take into account the shape information contained in a one-dimensional signal. 
The main idea of Kernel Methods is presented, giving more details on a particular 
subclass of kernel functions applicable to structured data, in particular trees. The 
MLA is used to define a mapping from the set of one-dimensional signal to the set of 
trees. Two new kernels that use the MLA representation are finally defined and a case 
study that regards sismographic signals is presented. 



Chapter 1 

Multi-resolution or multi-scale 

methodologies 



The proposed methodology is essentially a multi-level decomposition of a one- 
dimensional signal. The key point of this method is the multi-level analysis. The 
idea of "multi-level" or "multi-resolution" is shared by several apparently disjointed 
methodologies. 



1.1 Motivation of Multi Layer Analysis 

Recently the multi-scale or multi-resolution models have been research topics in rapid 
evolution, with great impact on Computer Science, Applied Mathematic, Image Anal- 
ysis and Signal Processing. The key idea of the MLA is to obtain several "views" or 
"features" of the same input data (at different scale, resolution or in a different domain) 
in order to perform a better and maybe more understandable analysis. Using this ap- 
proach it is possible to focus on the regions of interest with a finer resolution, having 
as a consequence an increase on the precision. The regions of interest can be detected 
by views or features at lower resolution; in this way it is possible to both obtain better 
results and an improvement in computational time. The idea of multi-scale analysis 
comes from the fact that many real systems have different behaviors at different scales. 
For example in physics there are different laws to describe a phenomenon at different 
scale or resolution, e.g. classical mechanics for describing the motion of macroscopic ob- 
jects in opposition to quantum mechanics that describes atoms and molecules. It is not 
an exaggeration to say that many real problems can be handled using different scales 
or resolutions. For example the human being organizes his time using seconds, hours, 
days, weeks, months, years reflecting the multi-scale dynamic of the solar system, using 
scale depending on the problem he is handling. The folding of a protein can require 
a time in the scale of seconds, while the scale of vibration of covalent bonds is in the 
order of 10 -15 seconds. In general, the more details of a system we want to model, the 
more complex the required laws to describe it becomes. 
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1.1.1 Multi-resolution or multi-scale methodologies 

In the following sections some approaches will be presented, at first sight disjointed, but 
actually exploiting the same idea of multi-resolution or multi-views analysis. In fact, 
the shared motivation of all these approaches is that in some cases it is easier, given an 
input signal, to extract and analyze a set of features or views that represents different 
information contained in it that analyzes the original signal. This is done by each 
methodology in different ways but the main idea that connects them is to decompose a 
signals into simpler parts (in frequency, time domain or in another scale or resolution) 
and perform the analysis combining the results information on each part. The MLA as 
well as the other methods exploits the same idea, in which the analysis is performed on 
several "parts" of the original signal obtained, as it will be explained in the next chapter, 
by a simple operation called threshold. Some aspects of these methods will be related 
to the MLA analysis in particular where there are strongly similarities or advantages of 
one method respect to the others. 

1.1.2 Discrete Fourier Transform 

One of the well-known methods that firstly exploited this idea is the Fourier Transform 
and in particular its variant for discrete signals called Discrete Fourier Transform 
(DFT). This transformation is mainly adopted when the information of interest are 
encoded in the frequency domain of a signal. In fact, the Fourier Transform and its 
discrete version i.e. DFT is an operation able to transform a discrete signal from the 
time domain into the frequency domain. This is done by decomposing it as a linear 
combination of sinusoidal components. Here the parts of the original signal are the 
pure sinusoids at different frequencies and phases. In more details the DFT decomposes 
a signal into a discrete spectra composed by its frequency components, while the 
inverse transform synthesizes the original signal from the frequency components into 
its spectra[78]. More formally: 



Definition DFT 

Given a discrete signal x(n) of N samples its DFT, and its inverse DFT are defined 

by these equations: 



Synthesis equation: 

2V-1 

x(n) = 2^c k e n (1.1) 

fc=0 
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Analysis equation: 



1 AT-l 

1 v^ — A j2irkn 

n=0 



In more details, the DFT allows to extract frequency, phase and amplitude informa- 
tion of the sinusoids coming from the decomposition of a signal. In addition, with the 
DFT, it is possible to find the frequency response of a system from its impulse response 
and viceversa. In this way it is possible to analyze a system in the frequency domain 
as it is possible to use the convolution to analyze a signal in the time domain. This 
approach in some sense extracts several views of the same input signal correspondent 
to the frequency components that it contains. However one of the main limitation of 
this approach is that it not perform well for non-stationary signals, and in addition it 
cannot characterize directly the shapes contained in a signal as it is possible instead to 
do with the ML A analysis. 

1.1.3 Wavelet Analysis 

A method that overcome some limitations of the Fourier Analysis is the Wavelet Analy- 
sis. A wavelet is a mathematical function and it is used to decompose a signal in compo- 
nents with different frequencies, resolutions and positions[l]. The position component 
is particulary useful when the input signal is not stationary i.e. it has been gener- 
ated by a stochastic process whose joint probability distribution does not change when 
shifted in time or space. For this reason wavelets are become popular and nowadays are 
widely used in multi-resolution analysis. The wavelets transform is the representation 
of a signal in term of scaled and translated copies of the same function called mother 
wavelet. More in detail, the wavelet transform is obtained by the convolution between 
a signal and a wavelet function, as illustrated in figure 1.1. It is possible to see in figure 
1.2 an example of scaling and translating a mother wavelet. A mother wavelet needs 
to satisfy some properties such as finite length and zero mean value. These properties 
make wavelet analysis more powerful than Fourier analysis since a signal can be decom- 
posed as a sum of the same wavelet properly translated and scaled, instead of using 
smooth and continuous function like sinusoids. This leads to a good decomposition also 
in the case of signal that shows discontinuities or in the case of non stationary processes. 
Figures 1.3,1.4,1.5 show some possible mother wavelets. 

Now it will be formally introduced the Continuous Wavelet Transform and the 
Inverse Continuous Wavelet Transform. 
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local matching of 
wavelet and signal leads 
to a large transform value 



Figure 1.1: Convolution of a signal with a wavelet function. (Part of) this figure is 
taken from [1] 

Definition Continuous Wavelet Transform 

The continuous wavelet transform or CWT of a continuous signal x(t), considering the 

mother wavelet ijj(a,b) is defined as: 



T(a,b) = w(a) I x{t)ij)* [ - ) 



(1.3) 



where ip* is the complex conjugate of the function ip, w(a) is a weighting function 
usually equal to -4= or -, a control the location of ip and b its scale. 

Definition Inverse Continuous Wavelet Transform, 

The continuous inverse wavelet transform or ICWT of the wavelet transform T(a, b) of 

continuous signal x(t) with respect to the mother wavelet ijj(a,b) is defined as: 



oo oo 

x ( t ) = — I / T(a,b)il> a>b (t)— r 



C„ 



(1.4) 



-oo 



where a control the location of ip used and b its scale. 



1.1.4 Scale Space Theory 

Another methodology that exploits the idea of decomposition of a signal in simpler 
"parts" is the Scale Space Theory that is a framework for a multi-scale representation of 
signals developed in the fields of computer vision, image processing and signal processing 
[50]. It is a formal theory applied to manipulate signals of one or more dimensions at 
different scales. Here the "parts" of a signal are structures or features at different scales 
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Figure 1.2: Scaling and translation of a mother wavelet. (Part of) this figure is taken 
from [1] 

contained in it and as in the wavelet approach the parts are obtained by a convolution 
of a base signal at different scales. The main difference is how the convolution is 
performed and how the information of the parts are combined. The concept of scale 
space is general and it can be used in an arbitrary number of dimensions. For simplicity, 
here the most used framework, that is the case of linear scale space in two dimensions, 
will be described. 



Definition Linear Scale Space 

Given a two-dimensional signal f(x,y) (e.g. an image), its linear scale space is a family 

of derived signals L(x, y, t) defined by the convolution of signal f(x, y) with a Gaussian 
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Figure 1.3: Haar wavelet. 




Figure 1.4: Mexican hat wavelet. 



kernel g: 



q(x,v,t) = e 



(1.5) 



such that: 



L(x,y,t) = g(x,y,t) * f(x,y) 



Where t = a 1 is the variance of the Gaussian. 



(1.6) 



The reason for generating a scale space representation of an image, for example, derives 
from the consideration that real world objects consist of different structures at different 
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Figure 1.5: Morlet wavelet. 



scales. This implies that the real- world objects are different from those of the idealized 
mathematical entities, such as points or lines, and may appear differently depending 
on the scale we use to observe them. For example, the concept of tree is appropriate 
if we think in the scale of meters, while the concept of leaf requires a finer scale. For 
example, a machine vision system that has to analyze an unknown scene, cannot know 
in advance which scales are appropriate to describe the data in the scene. For this 
reason, a reasonable approach is to consider descriptions of the scene at different scales 
simultaneously. An example of this approach is illustrated in figure 1.6. 



1.1.5 Quadtree Analysis 



Quadtree Analysis is another image analysis technique that consists in iteratively split- 
ting an image into blocks that are more homogeneous than the image itself by using a 
particular data structure called quadtree [20]. This technique, examining the image at 
different resolutions, allows to obtain information about its structure. It is also used 
as the first step in adaptive algorithms for image compression. The technique consists 
in dividing a square image into four blocks of equal size, and then test whether each 
block meets some homogeneity criterion (for example, if the gray levels of all pixels 
belonging into a block have a specific range of values). If the block meets the criterion 
it will not be further splitted, otherwise it will be again divided into four blocks that 
will be tested again according to some homogeneity criterion. This process is iterated 
until each block meets the criteria. The entire process obviously will split the image 
into blocks of different sizes. An example of quadtree analysis, used to detect salient 
objects in an image, is shown in figure 1.7. 
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1.1.6 String methods 

In a lot of discipline the input data comes in natural form as string: bio-sequences, 
graphs and text documents. In this scenario there are several methodologies that exploit 
the "multi views" approach in terms of subsequences or substrings of the input string. 
For example there are several similarity measures between string objects in which the 
more similar, the greater it is the number of the factors they share [53]. Another example 
that will be presented in detail in chapter 5 is the family of convolution kernels[34]. The 



(a) L(x, y, t) at scale t — (b) L(x, y, t) at scale t — 1 (c) L(x, y, t) at scale t = 4 

(original image) 



(d) L(x,y,t) at scale t — 16 (e) L(x,y,t) at scale t = 64 (f) L(x,y,t) at scale t = 256 
Figure 1.6: Scale Space representation 
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Figure 1.7: Quadtree image segmentation 
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basic idea of a convolution kernel is to decompose a data object into simpler parts and 
then define a kernel function in terms of such parts. A very common kernel for string 
classification (especially protein sequences) that exploits this idea is the spectrum kernel. 
The main idea behind it, is that the more substrings with a fixed length are shared by 
two string, the more similar they are (see [49] for details). More formally let's consider 
the following definition: 

Definition Spectrum Kernel 

Let £ be a finite alphabet, S* denote all possible string over £ and S fc all the string 
over X of length k. Let #x[w] denote the number of occurrences of w in x e.g. #x[w] = 
\{y\x = y ■ w ■ z Ay,z £ £*}| and Gjjx] the k-gram vector of x over all the string in E 
e.g. G/cfx] = (#x[w]) w£ -£k. Given a k 6 N the spectrum kernel can be defined as: 

5 fc (si,a 2 ) = Yl # S1 H • # S2 N = ( G k[*i],Gk[*2]) (1-7) 

1.1.7 Level Set 

Another approach that decompose a signal in parts and that is very close to the MLA 
is the Level Set method that is a numerical technique for the recognition of shapes in a 
signal [74]. This methods is based on the fact that usually, it is easier to characterize 
a shape using a particular set of auxiliary functions called Level Sets than using the 
shape directly. In fact the level sets allow to characterize a shape considering several of 
its levels or subviews. In figure 1.8 it is possible to see a pictorial representation of this 
approach on a function of 2 variables. Now it will be provided the formal definition of 
Level Set: 

Definition Level Set of a function 

Starting from a function / : M. n — >la level set is a set of the form: 

{(xi, . . . ,x n )\f(xi, . . . ,x n ) = k} (1.8) 

If n = 2 this set is called level curve, if n = 3 the set is called level surface or more 
in general if n > 3 it is called level hyper surf ace. In particular using a level set it is 
possible to express a closed curve T indirectly using the the function / and considering 
the level set: V = {(xi,... ,x n )\f(xi,. . . ,x n ) = 0} 

As it will be possible to see later, the MLA idea in some sense is very close to 
this approach since the information that characterize the signal are similar. The main 
difference is the way the information are organized, in fact with MLA it is possible to 
characterize any shape in a natural and elegant way using a particular structure to store 
these information. 
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Figure 1.8: Level Set representation for a function depending on 2 variables. 

1.2 Pattern Discovery and Classification 

The next section presents two machine learning techniques in which the MLA can be 
promiscuously integrated. For this reason here will be introduced the general problems 
of Pattern Discovery and Classification, while in chapter 3 will be cover in detail how 
to integrate the MLA in these contexts. 



1.2.1 Pattern Discovery 

Pattern discovery is a general discipline in which the main goal is to process a large 
amounts of data in order to efficiently extract unknown useful knowledge [87] . In other 
words a pattern discovery method discovers subsets of input data that are meaningful 
accordingly to a formal criteria. More in general, the pattern discovery is a research 
area that provides efficient methods to uncover, without using "a priory" knowledge on 
the data, patterns that are repetitive, unexpected or interesting, using a formal criteria. 
In order to better understand pattern discovery, it is first necessary to define the 
meaning of pattern. Informally a pattern is any relation in the data that is of our 
interest and that is not casual or random. In other words it is necessary to answer to 
the question: how meaningful is a pattern? This is because the human mind has the 
tendency to see patterns everywhere. For this reason, it is necessary to understand if 
a pattern is significative in a rigorous way. More formally, a pattern is a data vector 
serving to describe an anomalously high local density of data points [32]. This means 
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that particular points have a different behavior than the points in other regions usually 
called "background" and that are not interesting since in those regions they have a 
behavior not related to the true process that has generated the "anomalies". 

During the last years a lot of attention was paid to this problem so that it is 
possible to find several tools in the realm of Statistic and in the Computer Science 
to address this problem. In particular these techniques can be fruitfully applied to 
several unconnected application domains such as: speech recognition, biology, finance 
and econometric, biomedicine, text analysis, statistics. As a matter of fact the data 
involved in the pattern discovery methods are of different kinds such as sequence, image, 
sound and structured data such as tree and graphs [87, 15, 6, 61, 13, 83]. 

1.2.2 General schema of a Pattern Discovery method 

A general pattern discovery method can be subdivided in three main parts [83] as it is 
possible to see in figure 1.9. 



Language of 
patterns 



Pattern 
Discovery 



Algorithm 



Score Function 



Figure 1.9: Pattern Discovery parts 

A language to describe the pattern; 

a score function to assesses the interestingness of a pattern; 

an efficient algorithm that identifies the most interesting patterns using the score 
function. 



Obviously, these three parts depend strongly on the particular application domains 
taken into consideration. In particular, this is true for the language used to describe 
the patterns, in fact the data are not always in the form of feature vectors or in term 
of some formal languages (or grammars). In this sense languages can be thought as 
a transformation that encodes the information present in the data in a suitable form 
for a particular score function. Another important point is the choice of the most 
suitable score function for the particular process that has generated the data, in order 
to discover the "anomalies". The last but not least important point is the scalability of 
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the algorithm that is fundamental in many practical application domains. In particular, 
this last point usually depends on the complexity of the language used to express the 
patterns and on the computational efficiency of the score function. For this reason it 
is necessary to consider a compromise between the expressivity and the computational 
efficiency of languages and score functions. 

1.2.3 Classification 

In recent years, several algorithms have been developed for classification, but all allow, 
albeit with different techniques, to match a set of elements defined over a space of 
features, with a set of labels corresponding to different groups or classes [24]. This is 
equivalent to partition the space of features into regions, assigning to each region a 
specific label. In general, classification refers to the class of methodologies of machine 
learning that given in input a set of data assign subparts of the input data to a given 
class taken from a finite number of categories. More formally, let's consider a set of 
observations X G W 1 , a set of elements Y = yx, ■ ■ ■ ,yu called labels and a function 
/ : X — > Y that defines the true mapping from the set X of observations to the set 
of labels. A classification algorithms considering a set D = (x\, y±), . . . , (x n , y n ) called 
training set produce in output a function g : X — > Y that approximate as close as 
possible the function /. The classification can also be seen as a problem of parameter 
estimation, where the goal is to estimate a set of functions of the form: 

P {class\x) = f (x; T) (1.9) 

where x G X represents the vector of input features for each item to be classified, 
and / is a function depending on a vector of parameters denoted by related to the 
specific classification problem. This function represents the probability that the element 
represented by the vector of characteristics, belongs to a particular class. In any case, 
the classification process generally follows the following steps: 

1. Selection of the classes of interest; 

2. Selection of the set of training; 

3. Statistical analysis of the set of training in order to assess whether they represent 
well the problem being tackled; 

4. Algorithm Selection for classification; 

5. Classification of data using the chosen algorithm; 

6. Validation of the results and their interpretation. 
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The most common algorithms to perform classification are: Bayesian Classifier, PC- 
Nearest Neighbors, Support Vector Machine, Decision Tree and Neural Networks. The 
interested readers can found a good survey of the principal classification algorithms 
here [24]. 



Chapter 2 

Multi Layer Analysis 



In this chapter a detailed and formal description of the Multi Layer Analysis (MLA) will 
be presented. The MLA is a general feature extraction method that can be adapted to 
discover patterns on one-dimensional signals or as a preprocessing step to classification, 
clustering and other data analysis techniques. 

2.1 The Multi Layer Analysis 

The MLA is a feature extraction method in which the processed input data can be used 
by a classifier or a clustering method in order to distinguish between several kinds of 
patterns. It is based on the generation of several sub-samples of the input signal, each 
one carried out by a particular threshold operation, chosen by respecting cut-set optimal 
conditions, within respect to the input data. In figure 2.1, it is shown a flowchart of the 
whole methodology. As it is possible to see in that figure, the method starting from the 
input signal and applying a set of simple operations, called thresholds, extracts a set of 
intervals. These intervals opportunely aggregated can encode the shape information of 
the input signal that can be used to characterize it or to discover structures contained 
in it. In the following, the formal definition of the threshold operation will be given, 
together with some some generic application of this transformation. 

2.1.1 The threshold operation 

Definition Threshold operation 

Given an input signal / the threshold operation o~k is defined as follows: 

r ) - \ ^( x ) tf K/fc)) is true 
I k otherwise 

where p is a generic condition defined on the elements of /. 

In the simplest case / can be defined in R and it is possible to set: 

, , 1V [ true if f(x) < d> , 

P{f{x)) = { (2.1) 

false otherwise 



20 



Chapter 2. Multi Layer Analysis 













Threshold j 
operator I 


Intervals 
extractor 1 






/<±) 






Input signal 


^Threshold | 
operator 


[ Intervals 
1 extractor | 


Aggregation 
I Rule I 




Figure 2.1: Schema of ML A processing 

This approach detects sub-samples deriving from threshold operations that satisfy 
structural or shape properties. An example of a simple threshold operation with 
condition expressed in equation 2.1 is depicted in figure 2.2. 

The key idea behind the MLA is to explore the input signal at different threshold 
levels that corresponds to its decomposition into several sub-signals, in order to discover 
the hidden pattern of interest. 



Definition General MLA 

The MLA can be defined as a set of sub-samples of a one-dimensional signal / 

MLA(f) = {a 1 (x),a 2 (x),--- ,a K (x)} 



(2.2) 



where each threshold operation indicated by the subscript of a could be characterized 
by a specific condition. 

The MLA is more accurate and robust in comparison to a naive methodology that, using 
a single threshold operation could give inaccurate results especially in the real case when 
the input data is affected by noise. The accuracy and robustness are due the fact that 
MLA uses more conditions p in order to validate the same hypothesis or conditions 
on the multiple sub-samples extracted from the input signals /. For this reason this 
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Figure 2.2: Thresold operation for three different values of (j) 

technique introduces a sort of "flexibility" to the analysis of a signal. After the multiple 
threshold operations called horizontal sampling it is possible to extract a set of intervals 
from the original signal and define its interval representation; it is also possible to 
organize these intervals using a particular rule called aggregation rule. A summary of 
the overall process is shown in figure 2.1. The next two subparagraph explain in details 
the horizontal sampling, the interval representation and the aggregation rule of a signal. 

2.1.2 The Horizontal Sampling, the Intervals Representation and the 
Aggregation Rule 

The core of the MLA is the interval identification obtained through the horizontal 
sampling procedure. 

Definition Horizontal sampling 

Given a bounded signal / : [a, /3] — > M + and K G N threshold operations a^ (k = 

1, ...,K) for each k it is possible to build a set of intervals: 



Vkiiki 



■ hi. 



} 



(2.3) 



where i\ 



[a^bl] with* 



1, • • • , nk, and at, b{ G R 



In the simple case in which the condition p of the generic threshold operation a^ is 
that expressed in equation 2.1 it is easy to prove that f{a k ) = f(b\) — tk- After 
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the horizontal sampling process, a different representation of the input signal, called 
Interval representation of / is drawn and it will be denoted with T(/). 

Definition Disambiguation operation 

To avoid ambiguities in the case / is discrete i.e. / : {1, 2, • • • , L} — > R + , and /(l) ^ 

min(f) or f(L) ^ min(f), f is transformed into a new signal /' : [a, b] — > R + : 



where 



min(f) if x = a\J x = b 
f{x) if 1 < x < L 



if /(l) ^ min(f) 

1 otherwise 



and 



ft= f L + l if/(L)^mm(/) 

L otherwise 

Definition Interval Representation 

Given a signal / and i^ threshold operations o^ (k = 1,...,K), and let Ik = 
{*fc'*l'"'' '^fc fe } ^ ne se ^ °^ intervals corresponding to a^, then the interval represen- 
tation of / indicated as T(/) is: 

T(f) = {h,I 2j ... ,I K } (2.4) 

Definition Aggregation Rule 

Given a signal / and its interval representation T(/) = {I±, I 2 , ■ ■ ■ , Ik} an aggregation 
rule is a rule that constructs sets of intervals taken from T(/) in order to character- 
ize or represent "interesting" subparts of /. In general it is possible to define several 
aggregation rules to express different shape properties present in a signal. In the next 
chapters it will be presented several examples of aggregation rules applied to different 
application domains. 

Definition Equally spaced simple MLA 

Without loss of generality, let assume that / : R — > [0, 1] and K > 2. The equally 
spaced simple MLA is carried out by considering the thresholds a^ with 1 < k < K 
defined as follow: 

r f(x) * /(*)<& 

o-k{x,(pk) = < , 

q>k otherwise 
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with (j>f. = -g* x (k — 1) 

As convention the first threshold operation corresponds to o"i(rr,0) and the last to 
ctk(x, 1)- Note that all the intervals extracted by the last threshold operation ctk by 
convention encompass a single point corresponding to the intersection of the signal with 
the straight line of equation: y = 1. In other words, these intervals Ik have the property 
that a K = b K , VI < t < K. In addition, by definition the first threshold operation 
collects only one intervals [1, L] where L = (3 + 1. An example of equally spaced simple 
MLA is depicted in figure 2.3. 




Figure 2.3: Equally spaced simple MLA 

In general the interval representation is lossy because it can only keep a subset of 
points of / that form the intervals in T(/) (see figure 2.4). 




Figure 2.4: Interval representation of a signal 
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Notice that as many other transformations presented in chapter 1, by using the 
MLA it is always possible to reconstruct a lossless version of the input signal if some 
conditions arise, and this will be discussed later. Obviously, the information loss in 
this representation decreases as the number K of threshold operations increases. Of 
course, it is always possible to reconstruct a lossy version of the original signal using 
an interpolation algorithm and using only the points of its interval representation. 
Given a generic signal / it is also obvious that it is always possible to obtain a lossless 
reconstruction of / from its representation T(/) as k — > oo. If / is a discrete signal, it 
is easy to prove that is always possible to obtain a lossless representation imposing that 
at least one of the threshold levels intersect each point of /, in particular the following 
theorem gives a way to calculate the minimum number of thresholds operations K to 
use in order to build a lossless representation using equally spaced thresholds. 

Theorem 2.1.1 Let £ m in be the "precision required, and let f : [a,f3] — > [0,1] be a 
discrete time signal of length L (\[a,/3]\ = L). Then the lower bound of threshold 
operations K allowing a lossless representation h of f using the equally spaced simple 
MLA (i.e. for each pair of adjacent point in h, d n = \h(n + 1) — h(n)\ >= c with c G M.) 
is: 



K 



L-l 

IE 



n=l 



d. 



1 



9 •*■ £-min 



(2.5) 



with g the GCD (Greatest Common Divisor) between all the integers: F 






n 



1.2,--- ,L 



}■ 



Proof Using a precision of e m j n it is possible to map the set of the absolute differences 
D = {d n = \f(n + 1) — f(n)\, n = 1, 2, • • • , L} in the set of natural numbers F = 



dn 



n = l,2, 



, L > and let g = GCD(F). By definition of g it results that 

dn 



with m n £ N, and 



g xm n 



L-l L-l 

^—i ^ — 1 y 



n=l 



n=l 



Lemma 2.1.2 Let e m i n the precision required, and let f a discrete signal of length L 
and without loss of generality let us assume that f as values in [0, 1] . Then 



L-l 



* = £ 



d n 



(2.6) 



n=l 
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Number of threshold operation K 


Kendall Correlation 


Length of representation 


2 


0.6900 


4 


4 


0.2846 


68 


8 


0.9420 


130 


16 


0.9973 


280 


32 


0.9987 


566 


64 


0.9999 


1440 



Table 2.1: Degradation of the signal for different values of K 



is the upper bound on the number of threshold operations K to obtain a lossless repre- 
sentation of f using an equally spaced subdivision of f . 

Proof The proof is straightforward, it is possible to obtain the largest K when the 
GCD g assume its minimum value, this value is 1 because one property of GCD is that 

Although the previous theorem and lemma show a lower and upper bound on K 
allowing a lossless representation of a discrete signal /, it is usually convenient for 
several reasons to optimize the search for the best smallest K allowing a reasonable 
lossy representation of /. It is obvious that the number of threshold operations strongly 
depends on the signal shape. For this reason, this representation is suggested when the 
information of the signal is encoded in the time space because it well characterizes the 
shape information (as a solution to this problem it could be possible to use the Fourier 
Transform and apply this methodology on the spectra of the signal). In figure 2.6 it 
is shown the progressive degradation of a signal as the number of threshold operations 
decreases, and in table 2.1 the number of points required to represent a signal giving a 
fixed level of K thresholds, and the correlation coefficient between the original and the 
reconstructed signal. In the subsection 2.2 a calibration procedure to select the proper 
value of K will be described. 

Note that this transformation cannot be simply related to the theory of sampling 
and in particular to the Sampling Theorem [64], because the non trivial distortion in 
the spectral components of the original signal that MLA could be introduce. 

Theorem 2.1.3 (Sampling Theorem [64]) If the highest frequency contained in an ana- 
log signals , x a (t) is F max = B and the signal is sampled at a rate F s > 2F max = 2B, 
then x a (t) can be exactly recovered from its sample values using the interpolation func- 
tion: 

sin(2irBt) 



g(t) 



2-KBt 



(2.7) 



26 



Chapter 2. Multi Layer Analysis 



In other words it does not exist a simple mathematical relation that link the two 
transformations because they extract different information from the signal, frequency 
and shape information as stressed before. As a enlighten example, consider two simple 
but opposite cases: a sinusoidal signal and a rectangular pulse signal. Looking at the 
figure 2.7 and 2.8 it is clear that this transformation introduces artifact on the spectrum 
for the simple sinusoidal signal, that can be represented only by one component with the 
Fourier Transform, but it is not present any artifact on the rectangular pulse signal that, 
in the continuous case, require infinite components to be represented properly in the 




00 120 140 160 



Figure 2.5: Original signal 






(a) Signal reconstructed with (b) Signal reconstructed with (c) Signal reconstructed with 
K = 3 K = 4 K = 8 







(d) Signal reconstructed with (e) Signal reconstructed with (f) Signal reconstructed with 
K = 16 K = 32 K = 64 



Figure 2.6: Degradation of the signal for different values of K 
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frequency domain. In other words the number of threshold operations doesn't depend 
directly on the frequency content of the input signal but only on the quantization levels 
needed to properly represent it. The quantization levels are obviously proportional to 
the smallest variation £ m i n that it is necessary to capture in the signal. If it is necessary 
to obtain in term of threshold operations an equally spaced "horizontally sampling" as 
in the case of equally spaced simple ML A it is possible to use the theorem 2.1.2. 




20 40 60 80 100 120 140 160 180 200 



Figure 2.7: MLA reconstruction of the simple sinusoidal signal with K = 8 



In some sense the MLA representation is related to the wavelet representation. In 
fact it is possible to think a signal as composed by scaled and shifted components (in 
sense of wavelet components) in which the mother wavelet is a single rectangle pulse 
as depicted in figure 2.9. The main difference with wavelet approach is that in MLA 
transformation the data are represented in a different way and the MLA "mother" 
doesn't need to have mean zero although it has finite duration. 
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Figure 2.8: ML A reconstruction of the rectangular pulse signal with K = 2 

2.2 Choosing the right value for the number of thresholds 

The bounds on the values of K given a quantization precision of e m in in the case of N 
equally spaced thresholds have been previously stated. An interesting question is: is 
it necessary to use all the levels that the upper bound stated in theorem 2.1.2? The 
short answer is no. A practical approach to follow, is to define a similarity measure 
between the original input signal and the reconstructed signal in order to have an 
idea on the "amount" of information that MLA representation induces. A set of natural 
similarity functions that can be suitable to this scope belongs to the family of correlation 
functions. Among the correlation functions, the most known are the Pearson, Spearman 
and Kendall correlation indices. 



Definition Pearson, Spearman, and Kendall correlation Given two signal x and y 
then the correlation indices are defined as: 
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Figure 2.9: MLA "mother" function. 



Pearson correlation 



YT=i( x i - x ) 2 T,?=i(vj - y) 2 



(2.8) 



Spearman correlation 



m a 2 



1 6E£iA 

n(n 2 — 1) 



(2.9) 



Kendall correlation 



n c -n d 
\n{n — 1) 



(2.10) 



where x = — ^ X{, y = — ^2 i Vi, A, is the difference between the ranks of X{ and yt, 
while n c and n^ are their number of concording and discording pairs, respectively. 
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signal / K 


5 


10 


50 


100 


earthquake 


0.3856 


0.6488 


0.9399 


0.9470 


gaussian 


0.9484 


0.9890 


0.9994 


1 


uniform 


0.9916 


0.9990 


1 


1 


sin 


0.9936 


0.9937 


0.9950 


0.9950 



Table 2.2: Information loss on the signal for different values of K 

In figure 2.10 it is possible to see four examples of real world and synthetic signals: 
an earthquake signal, a gaussian noise signal generated in accordance to the Gaussian 
distribution equation 2.12, a random uniform signal, generated in accordance to the 
uniform distribution equation 2.11, and a sinusoidal signal. 

Definition Uniform Distribution The uniform distribution [27] is a distribution that 
has constant probability over an interval [a, b], and its probability density function p is: 



p{x) 











for x < a 



j^ for a < x < b 



(2.11) 



for x > b 



Definition Normal or Gaussian Distribution The Normal or Gaussian distribution [27] 
is a probability distribution with probability density function: 



fix) 



1 



-x 2 /2 



2tt 



(2.12) 



Table 2.2 shows the number of levels required to obtain a correlation value of at least 
0.9 (using the Kendall's correlation, equation 2.10) in the case of four examples. It is 
also important to take into account the length of the signal representation that obviously 
strongly depends on the number of levels used. The following theorem gives an upper 
bound on the length of the representation of a signal using K threshold operations. 
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Theorem 2.2.1 Given a discrete signal f of length L > 3 and let K > 2 the number of 
threshold levels in the equally spaced simple MLA transformation then the upper bound 



Imax on the number of intervals of its representation T(/) is: 



m 



*{K-\) + \ 



and the real numbers required to represent the intervals are in number of: 



n max\-L->) — 2 * 



* (K - 1) + 2 



(2.13) 



(2.14) 



Proof To avoid confusion, remember that for definition the equally spaced simple MLA 
adds at the beginning (or to the end) of the signal / a point equal to min(f) if /(l) 7^ 
min(f) (or if f(L) 7^ min(f) ) by the disambiguation operation. It is possible to define 
two kinds of worst case signal, one for L odd (see figure 2.11 (a)), and one for L even 
(see figure 2.11 (b)). The even worst case signal involves always the addiction of a single 




If Vi 



(a) Eartquake signal 



(b) Gaussian noise 



1 



1 




(c) Uniform noise (d) Sinusoidal signal 

Figure 2.10: Different examples of signals (all of length 400) 
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new point by disambiguation, while two points are added in the case of a worst case odd 
signal. Moreover, the addition of a new point to the signal, involves the introduction 
of K — 1 new intervals as it possible to see in figure 2.12. Further it will be considered 
a generic threshold operation a^ with k 7^ 1 because by definition, the first threshold 
operation extracts always only one interval independently on the length of the signal. 
Note also that, in the case of the best case signal with L odd points, the number of 
interval is exactly I m i n (V) = [^J * [K — 1) + 1 (see figure 2.11 (c)). 
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Figure 2.11: (a) Odd worst case,(b) Even best and worst case,(c) Odd best case 
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Figure 2.12: Intervals increment: each point added can be add no more than k — 1 
intervals 

Let's recall two simple properties of the ceil and floor function: 



if L E N is even then: 



L-l 



(2.15) 



if L E N is odd then: 
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(2.16) 



Suppose to have a signal of length L, consider two cases, L even or odd: 



(L) even: Since L is even, only a new point has to be added. The resulting signal 
can be seen as the extension of the best case signal with L — 1 odd points by 
adding two new points, and applying the induction, and the properties 2.15, 2.16 
it results that I max (L) = I min {L-l) + {K-l)= [^\ * (K - 1) + 1 + (K - 1) = 
(L^J + 1) * {K - 1) + 1 = [■*=!] * {K - 1) + 1 = [f 1 *(K-1) + 1. 



(L) odd: Since L is odd, the worst case signal involve the addiction of two new 
points. The resulting signal can be seen as the extension of a best case signal with 
L odd points by adding two new points, and by applying the induction, and the 
property 2.16 it results that I max (L) = I min (L) + (K-1)= [^\ * (K - 1) + 1 + 
(K-1) = ([%\+1)*(K-1) + 1= \%) *(K-1) + 1. 



Lemma 2.2.2 Given a discrete signal f of length L and let K > 2 the number of thresh- 
old levels in the equally spaced simple MLA then the complexity of this transformation 
is 0(K * L) 
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Proof Using the previous theorem, it is clear that in the worst case it is possible to 
obtain ["g- ] intervals for a generic threshold operation. Since the transformation uses in 
total K threshold operations in the worst case it is possible to obtain 1^1 * K interval 
extractions. 

2.3 Usage of the MLA as preprocessing step 

In general it is possible to find two principal problems in which MLA can be successfully 
used: 

• given a family of signals and a signal in this family, characterize it in terms of the 
other signals in the family; 

• given a signal, discover if it contains interesting substructures in some formal 
sense. 

In more details given a signal / and its MLA representation T(/) there are several 
ways to use it, the most trivial is to use the intervals "as they are" in a feature vectors 
fashion. It is important to note that they are not real feature vectors since given two 
signals of equal length not always involve the same representation. In other words it 
is not possible to have a positional representation of the feature of a signal as in a 
classic feature vector. For this reason, in order to compare two or more signal using 
the MLA representation, special distances or more in general dissimilarity functions 
need to be defined. One way to overcome this problem is to use a set of probability 
distributions to model the output of threshold operations. It will be shown an example 
of this approach on chapter 4 where a randomness test that exploits this idea will 
be presented. If we need intstead to characterize subparts of a signal, it is necessary 
to define aggregation rules that reflect our "interestingness". It will be presented this 
approach in the next chapter where a rule that well characterizes a biological structure 
(the nucleosomes) will be defined. An extension of this approach will be presented in 
chapter 5 where a new structure using a particular intervals aggregation rule, called 
Tree Interval Representation, will be introduced. It will give also the possibility to 
define a new kernel function by taking inspiration from the well-known tree kernels 
that have been successfully used in a completely different context: the processing of 
natural languages and the text categorization. In particular each of these chapters will 
be organized in two parts: the first part will show the formal definitions and the second 
part will present the real problem and the proposed solution, highlighting where the 
MLA takes place and, if possible, a comparison with the state of the art methodologies. 



Chapter 3 

Pattern Discovery and 
Classification by MLA 



This chapter presents the MLA in the context of Pattern Discovery and Classification; in 
particular the section 3.1 explains how MLA can be integrated in these contexts. Then 
in section 3.3 a case study is introduced: it regards a particular biological problem, 
the nucleosome spacing, in which the MLA was successfully used (see section 3.1). 
In addition, in section 3.4, an alternative approach for this problem based on Hidden 
Markov Model is presented, while in section 3.6 a comparison of the two methods is 
presented. Finally, the last section is devoted to the description of a new one-class 
classifier that was used as new classifier module of the MLA. 

3.1 MLA in Pattern Discovery and Classification 

This section explains how it is possible to apply the MLA in the context of pattern 
discovery. A general schema of pattern discovery that takes advantage of the MLA is 
presented in figure 3.1. The important point here is that MLA plays the role of the 
language to express the pattern as it was explained in chapter 1. In particular, given a 
signal / the patterns correspond to subregions of / that can be found using its interval 
representation T(/) together with an appropriative aggregation rule. In particular as 
expressed in chapter 2 it is convenient to use the MLA in order to characterize or 
discover patterns in term of their shapes. This means that a general criteria to assess 
if a pattern is interesting into this context, is to check how close a subregion of a signal 
expressed in term of intervals meets a particular aggregation rule criteria or intervals 
distribution. In the latter case this means that it is possible to define an expected 
intervals distribution for a "background" that can be used to assesse how interesting a 
pattern is. This approach, as it will be shown in a case study described in the next 
section, is particularly useful and natural for signal segmentation. 

In the classification problem, since it is necessary to provide an explicit training 
set (i.e. some examples for each class to discriminate), the MLA can be used as 
feature extractor, in the sense that each element can be expressed using MLA as 
its interval representation, or more in general in a structure built on its interval 
representation using a particular aggregation rule. Here, an element of a class can be a 
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Figure 3.1: Pattern Discovery by ML A and signal segmentation 

whole signal or a subpart of a signal maybe extracted with a pattern discovery approach. 

In the next section, the basic biological notions will be provided in order to in- 
troduce the MLA in the context of pattern discovery and classification for a particular 
biological problem: the nucleosome spacing. 

3.2 Fundamentals of Molecular Biology 

In this section some concepts and notions of biology will be described, in order to 
introduce the basic terminology that can be useful for the comprehension of the matter. 

3.2.1 DNA 

DNA is a double helix molecule formed by two chains (helices) oriented in opposite 
directions, as shown in the figure 3.2. DNA is present in every cell in the body and con- 
tains all the genetic information necessary for the body. The major classes of organisms 
are eukaryotes and prokaryotes. In eukaryotes DNA is contained within the nucleus, 
separated from the cytoplasm; in prokaryotes, instead, it is contained in cytoplasm. 
DNA is composed of four distinct types of bases, called nucleotides, that consist of 
three parts: a phosphate group, a sugar (deoxyribose) and a nitrogenous base (purine 
or pyrimidine). The four bases that forms the DNA are: adenine (indicated by A), 
cytosine (indicated by C), thymine (indicated by T) and guanine (indicated by G). The 
DNA bases are complementary: a C always pairs with a G and an A with a T. The 
complementarity of the two chains allows to represent a DNA sequence using only one 
of the two because the other one is complementary and then the information it contains 
is redundant. 



3.2.2 Genes and proteins 

Genes correspond to particular sub-sequences of DNA. They belong to the genome of 
an organism, which can be composed of DNA or RNA; the genes in particular direct 
physical and behavioral development of the body. Genes also determine the amino 
acid sequence of proteins, which are the most involved macromolecules in biochemical 
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Figure 3.2: DNA structure 

and metabolic processes of the cell. Some other genes do not encode proteins but 
encode RNA that plays a key role in gene expression. In a cell there are thousands of 
different proteins, each with a distinct amino acid sequence. In particular each amino 
acid is encoded by exactly 3 nucleotides as it is possible to see in figure 3.3 and there 
are 20 amino acids in total. In general, a protein is a polymer composed by different 
combinations of amino acids that bind each other through some interactions that are 
called peptide bonds. Proteins play a variety of tasks in the cell. In fact, they transmit 
messages between cells, turn on and off genes, are essential in muscle contraction, and 
finally build structures such as hair. Proteins are characterized by a three-dimensional 
structure articulated on four structural levels, in relation to each other: 



1. The primary structure is the one that identifies the specific sequence of amino 
acids from the peptide chain. 

2. The secondary structure corresponds to several configurations such as the spiral 
shape (or alpha helix), the planar (or beta sheet), the three intertwined filaments 
and those belonging to the globular KEMF (keratin, epidermina, myosin, fibrino- 
gen). 

3. The tertiary structure represents the three-dimensional configuration of the 
polypeptide chain. This configuration is permitted and maintained by different 
chemical bonds, including the sulfide bridges and the forces of Van der Waals. 

4. The quaternary structure determines the association of two or more polypeptide 
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units, or of protein and non-protein units, joined together by weak bonds, such as 
sulfide bridges, but in a very specific way, such as it occurs in the formation of the 
enzyme phosphorylase, consisting of four sub-units, or from hemoglobin, which is 
the molecule responsible for transporting oxygen in the body. 
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Figure 3.3: Amino acids alphabet in terms of DNA alphabet 



3.2.3 Protein production and expression level of a gene 

The production of a protein from a gene is called gene expression. To obtain a protein 
from a gene, the information in DNA is copied through a process called RNA transcrip- 
tion. RNA in the form of mRNA acts as a messenger and delivers information from 
the cell nucleus (where DNA is located) to the cytoplasm. Once in the cytoplasm, the 
mRNA is translated in its product, the protein, thanks to the usage of the alphabet of 
amino acids. Then the protein is built starting from the original DNA sequence repre- 
senting the gene, as it possible to see in figure 3.4. Each cell of an organism contains the 
same DNA, so the same information; however cells are specialized according to their 
function. This specialization is because not all genes are expressed at the same time 
and within the same cell. In fact, gene expression is a controlled dynamic phenomenon 
so that the processes of a cell are carried out in a controlled way. This phenomenon 
is regulated by several proteins that bind each other different regions of DNA. This 
adjustment may depend on the function that a cell has to make and it is regulated by 
both external factors and internal factors produced by the cell. 
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Figure 3.4: From a genomic sequence to a protein 



3.2.4 Nucleosome and chromatin 

As said before, DNA contains all the information of an organism and it is organized 
in a specific space configuration called chromatin and in particular in chromosomes. 
More in detail, there are fundamental units called nucleosomes that package DNA into 
chromatin and there are several levels of space organization from DNA to a chromosome 
as it is possible to see in figure 3.5. The nucleosome, whose discovery dates back to 1974, 
is the fundamental unit of chromatin structure and consists of a segment of about 150bp 
of DNA associated with a quaternary structure of proteins called histone octamer. The 
nucleosome has a compact globular shape and plays the role to compact DNA in a 
eukaryotic cell. In figure 3.6 it is possible to see the stylized structure of a nucleosome. 
Nucleosomes have a diameter of about 11 nm and are spaced from each other by a 
stretch of DNA linker varying in length from a few to about 80 pairs of nucleotides. 
The resulting structure has the characteristic appearance of a necklace of pearls and 
is the first level of compaction of chromatin. The formation of nucleosomes in fact 
converts a molecule of DNA in a strand of chromatin along about a third of the original 
length. This structural organization was highlighted after isolating the nucleosomes 
from chromatin. Several factors can influence the nucleosome organizations [72] and 
therefore the chromatin. Recent studies has shown that one of this factor is the sequence 
specificity that consists in the nucleosomes preference for some sequences: in particular, 
in vitro studies have shown that nucleosomes have a strong preference for some DNA 
sequences [70] and instead "don't like" other sequences such as poly (da,dt) tracts [71]. 
Another important factor is their statistical positioning [46]. This theory is based on 
the concept of barriers, that are regions on the dna in which the nucleosomes cannot 
stay. Barriers in particular on average regulates the positions of nucleosomes around 
them. An important result is that it is possible to derive mathematically the probability 
function on the preferences of nucleosome around the barrier. The last point is the set 
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of chromatin remodeler complexes that actively move the nucleosomes across DNA [66]. 
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Figure 3.5: From DNA to chromatin 



3.2.5 Microarray 

A DNA microarray (commonly known as gene chip, DNA chip, or biochip) is a collection 
of microscopic DNA probes attached to a solid surface such as glass, plastic or silicon 
chip forming an array [3]. These arrays are used to examine the expression profile 
of a gene or to identify the presence of a gene or of a short sequence on thousands 
(often the entire genome of an organism). Each location corresponds to a specific 
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Figure 3.6: Nucleosome structure: in blue the octamer, in orange the DNA 

gene (or a specific sequence) and it does contain multiple copies of a filament with a 
particular sequence of bases. These DNA strands are anchored to the surface of the 
substrate, and are used as probes to measure the amount of other DNA molecules (which 
are also single-stranded) derived from mRNA transcripts and contained in a solution 
that is deposited on the surface of the microarray. The main approaches used in the 
manufacturing process of the microarrays are two: one process is to deposit, with the 
help of a robot, a solution containing the DNA probes on the surface of the solid support. 
The probes can be made of a single-stranded cDNA (complementary DNA obtained by 
an mRNA transcript having a length of 200-2400 bases) or can be made of pre-chemically 
synthesized oligonucleotides (short sequences of nucleotides with a length of 50-100 
bases). Microarrays made by this process, are called "cDNA microarraies" [3]. The other 
process is to directly synthesize oligonucleotides on the surface of the microarray(in 
situ); this operation is carried out mainly with photolithographic techniques (typical of 
Affymetrix) and inkjet printing [3]. 

The advantage of using microarrays is the possibility to examine a large amount 
of data per experiment; for example, it is possible to monitor the expression levels of 
thousands of genes at a time. In the figure 3.7 it is possible to see the workflow that is 
usually followed when using the microarray technique: 

• Preparation and marking of the sample (different samples are labeled with differ- 
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Figure 3.7: Microarray workflow 



3.3 Case Study: Nucleosome Positioning 

The biological problem under consideration concerns the positioning of nucleosomes in 
DNA. This problem is very interesting because the accurate and precise measurement 
of the nucleosomes position on genomic scale could improve the understanding of the 
chromatin structure and its function. Alterations in chromatin and hence in nucleo- 
some organizations can result in a variety of diseases. In fact, the emergence of diseases 
is thought to be due to the fact that the altered chromosomes condensation leads to 
the expression increase of certain genes, causing abnormal production of proteins in 
the cell. This motivates the use of a methodology capable of determining the position 
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of nucleosomes, in order to study the implication of nucleosome spacing in the chro- 
matin condensation phenomena. This may be investigated by comparing the positions 
of nucleosomes in different contexts in which there are different amounts of proteins 
that remodel chromatin by changing their position. This would figure out which is the 
molecular basis of chromosome condensation defects or defects in gene expression caused 
by the partial or total absence of these molecular machines. In fact, it would be possible 
that the nuclesome spacing is the basis of this, which would mean that in the absence 
of such molecular machines, nucleosomes were not spaced properly carrying abnormal- 
ities in the cell. So it is very important to understand the processes that modulate the 
chromatin dynamics and in particular the nucleosome positioning. Their positioning in 
fact plays a direct role in gene regulation [51]. While the packaging that they provide 
allows the cell to organize a large and complex genome in the nucleus, they can also 
block the access of transcription factors and other proteins to DNA [17]. For exam- 
ple, under normal conditions the Pho5 promoter in yeast is occupied by well-positioned 
nucleosomes, preventing the transcription factor Pho4 from binding to its target bind- 
ing site. When induced by phosphate starvation, the nucleosomes are depleted from 
the promoter region so that Pho4 can bind to its target DNA binding sequence thus 
activating the Pho5 gene transcription [79]. However, nucleosome binding can some- 
times enhance transcription by bringing distant DNA regulatory elements together [84]. 
Genome-wide studies have found that transcription activity is inversely proportional to 
nucleosome depletion in promoter regions in general [5, 63, 47]. With the help of tiling 
arrays at 20bp resolution, Yuan et Al. [90] have looked at nucleosome occupancy rela- 
tive to gene regulatory regions on 4% of the yeast genome by using an Hidden Markov 
Model approach HMM. The used microarray-based method allows the identification of 
nucleosomal and linker DNA sequences on the basis of susceptibility of linker DNA to 
micrococcal nuclease. This method allows the representation of microarray data as a 
signal of green/red ratio values showing nucleosomes as peaks of about 150 base pairs 
long, surrounded by lower ratio values corresponding to linker regions. Consistent with 
previous studies, Yuan et Al. found that 87% of the transcription factor binding sites 
[33] are free of nucleosome binding. A substantial improvement over this work has been 
recently done by Lee et al. [48] where the genome-wide nucleosome positions in yeast 
have been mapped at 4bp resolution. A similar approach has also been used to look 
at differences in nucleosome spacing occurring in the absence of a chromatin remodeler 
[86]. A number of other groups have developed analysis methods to detect nucleosomes 
as well as transcription factor binding sites [10, 40, 45, 91, 43, 44, 55, 88]. Compared 
to transcription factors, it is more challenging to detect nucleosome positions since the 
majority of a eukaryotic genome is wrapped into nucleosomes. Another difficulty is that 
the raw data may contain complex trends that are unrelated to nucleosome binding [90]. 
An intuitive method to deconvolve data trend is to define a peak-to-trough difference 
measure and to detect its local maxima. However, Yuan et Al. [90] have found that al- 
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though this method can detect local peaks, it suffers from amplifying observation noise. 
A similar approach has been adapted in [60] to map nucleosome positions in human. 
Although an intrinsic DNA code for nucleosome positioning has been recently reported 
[69], a significant technological development in genome- wide location of nucleosomes 
has been made using "deep sequencing" approaches [2, 4, 56, 41], which differs from 
microarray-based approach in that the isolated DNA of interest is mapped to genome 
via direct DNA sequencing, instead of microarray hybridization. For this new tech- 
nology, the input data correspond to peaks of DNA fragment counts instead of high 
hybridization ratio. However, the task of peak detection remains a key problem for the 
statistical analysis of the input data. Unlike microarray-based approaches, where data 
collection is constraint to a regular grid, "deep sequencing" data are intrinsically base- 
pair resolution and therefore less statistically stable. One solution to this problem is to 
first map the data onto a regular grid by binning. However, more sophisticated meth- 
ods need to be developed to balance the resolution vs variance dilemma. The analysis 
of stochastic signals aims to both extract significant patterns from noisy background 
and to study their spatial relations (periodicity, long term variation, burst, etc.). The 
problem becomes more complex whenever the noise background is structured and un- 
known. Examples of such kind of data correspond to protein-sequences in the study of 
folding [21] and the positioning of nucleosomes along chromatin in the study of gene 
expression [90]. The analysis carried out in both cases has been based on probabilistic 
networks [39] (for example, Hidden Markov Models [26], Bayesian networks). Methods 
based on probabilistic networks are suitable for the analysis of such kind of signal data; 
however, they suffer of high computational complexity and results can be biased from 
locality that depends on the memory steps they use [90, 21]. In the next section it will 
be presented an approach that takes advantage of the MLA and its comparison with 
the proposed method based on HMM. The main advantage of MLA over HMM is its 
scalability that produce a significant reduction in computational time over the HMM. 
In this case study in particular it was considered the performances of these two methods 
to both synthetic and microarray-based nucleosome positioning data and their ability to 
recover distinct nucleosome configuration. This configurations could be underlie impor- 
tant regulatory roles, highlighting the impact of these methodologies on genome-wide 
nucleosome positioning studies in higher eukaryotes. 



3.3.1 The microarray and the signal 

The following describes the microarray structure designed and used in the Bauer Center 
laboratory for Genomics Research, Harvard University [90]. As mentioned before, a 
DNA microarray was used to extract the sequences corresponding to nucleosomes and 
those corresponding to the linker, in order to identify the nucleosomes on a genomic 
scale. In particular the microarray data, S, are organized in T contiguous fragments 
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S\,--- ,St which represents DNA sub-sequences. In order to obtain the signal on 
which subsequent processing are made, carrying out as follows is needed: Firstly, DNA 
wrapped in the nucleosome is isolated and labeled with a green fluorescent dye (it is 
marked the entire genomic DNA of the organism, chromatin is then digested with a 
particular enzyme that cuts in the linker regions of nucleosomes but leaves intact the 
DNA around the nucleosome). At the same time the genomic DNA is marked with a 
red fluorescent dye. At this point there is a competitive hybridization; if both probes 
are hybridized in equal proportions, a yellow spot will be obtained, while a red spot if 
the probe with the red marker is the more hybridized, otherwise a green spot. As a 
result red or green spots will be obtained as it is possible to see in figure 3.8. 
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Figure 3.8: Microarray probes 

In particular, in such data, each spot corresponds to a sequence of 50 base pairs. 
These sequences are overlapped of 30 base pairs in order to obtain a final resolution of 
20 base pairs. With this resolution a nucleosome, which occupies about 150 base pairs, 
will correspond to about 6-8 probes in the microarray. These nucleosomes are called 
well-positioned nucleosomes. There is also a class of decentralized nucleosomes, that 
can occupy multiple positions due to thermodynamics factors or that can correspond 
to segments that may come from cells in different states. The next step is to excite 
the two dyes with a laser scanner, using different wavelengths; in this way a separate 
scanning of red and green channels is obtained. To see if the sequences are hybridized 
or not, their logarithmic ratio has to be considered: 

cr 

R 



S = log 2 



(3.1) 



This will give a signal with a pattern which will have peaks in the presence of nucle- 
osomes. An overview of this method and a fragment of this signal is shown in figure 



46 



Chapter 3. Pattern Discovery and Classification by ML A 



3.9 



v-g 



Qt? / ^§S- 







!EO^raJ-.'-l7J3'fEMl.I 



Figure 3.9: From microarray to one-dimensional signal 



3.3.2 Preprocessing 

Before the analysis, the signal coming from the microarray is normalized in order 
to remove possible measurement errors (bias) and to reduce the influence of cross- 
hybridization. Normalization is a two-step process: 

• the mean and variance of each group of spots is taken into account, 

• the cross-hybridization and the entropy of the signal (base sequence) is taken into 
account. 

The cross-hybridization is the hybridization of segments that do not have a per- 
fect match but only a partial one, and consequently do not match and should not be 
considered. The entropy here is intended the classic definition proposed by Shannon: 

k 
Ei = -^Pk logp/c (3.2) 

fc=i 

Where p^ represents the probability of emission of the k — th symbol, that is defined in 
the alphabet of the bases that constitute the DNA (A, T, C, G), and /j indicates the 
length of segments in each spot. 

The first phase of standardization will reduce the bias caused of different groups in 
which take place the hybridization. In particular this phase uses the following model: 



Vij = a i(^i +Pj) + £ 



(3.3) 
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where y^ represents the logarithmic ratio of the observed value of i — th probe of the 
j — th group, fii is the normalized value desired, (3j and (jj are respectively the mean 
and variance of the j — th group and s is an instrumental error term, which is assumed 
to be independent and have zero mean. 

In the second phase of standardization the objective is at least to reduce the effects 
of cross-hybridization, as this is considered unavoidable because of the large number of 
bases considered. In trying to reduce cross-hybridization two factors are considered: 

• A specific component that measures the number of small sequences that cross- 
hybridize with long overlaps with the sequences of the probes; 

• An unspecified component that measures the case in which a large number of 
sequences are weakly cross-hybridized with small overlaps with the sequences of 
the probes. 

The first component was modeled by a discrete value Bi, which is set to 1 if the 
sequence of a probe, (which as mentioned before is 50 bases long) corresponds to another 
sequence of equal length for at least 30 pairs of basis (a partial match, but not negligible), 
which would introduce an unwanted positive contribution to the signal of the logarithmic 
ratio. Otherwise, the value of Bi is set to 0. The second component was modeled with 
Ei, i.e. the entropy of the i — th sequence present in a probe. The normalized value v 
of the probe i of the group j is then obtained as: 

Vi = fH+ (w B m + qb)Bi + (wEfJ-i + qE)Ei (3.4) 

where wb Qb e we Qe are the linear coefficients estimated respectively for the first and 
second component, obtained by linear regression. 

3.4 First solution: Hidden Markov Model 

In this paragraph a formal definition of HMM will be outlined, and then a model 
topology designed for the particular biological problem of nucleosome identification 
will be given. 

The HMM is a statistical signal modeling technique used in various disciplines such 
as alignment of gene sequences, acoustic modeling, speech recognition and OCR tech- 
niques [25, 65, 9]. In this model, once defined the alphabet of symbols that make up 
the signal, a set of states are defined, each of one is associated with a particular prob- 
ability distribution to produce a particular symbol of the alphabet. It also necessary 
to define the probability of transition from one state to another, and the probability 
distribution of initial states. In this way this model leads to a weighted graph where 
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the edge weights represent the probability of transiting from one vertex to an adjacent 
one. The modeling of the signal can then be seen as a visit on this graph, where every 
time a vertex is visited, a symbol is produced. A formal definition of HMM will now 
be given. 

Definition Hidden Markov Model 

Let S an alphabet of M symbols. 

A HMM is a quintuple: A = (TV, M, A,B,ir) where: 

• TV is the number of states of the model indicated by the integers 1,2, . . ., TV; 

• M is the number of symbols of the alphabet that each state can produce or 
recognize; 

• A = (tty) is a matrix called transition matrix where aij represent the probability 

of transition from the state i to the state j with 1 < i, j < N. This matrix must 

also satisfies the following condition: ^Ojj = 1, Vi 

j 

• B is the probability distribution of the observations, where bj (a) represents the 
probability of recognizing or generating the symbol a £ £ if you are in the state 
j. In addition, The condition ^ bj (a) = 1, Vj needs to be met; 

• 7r ut is the probability distribution of initial states, where with 7Tj is denoted the 
probability of starting from the state i. In addition, the condition ^7Tj = 1, Vi 

i 

needs to be met; 

The transition matrix A induces a directed graph where nodes represent states, and arcs 
are labeled with their corresponding transition probabilities. The term hidden refers to 
the fact that, given a sequence of symbols that composes the signal you want to model, 
and set a model, the sequence of states is hidden and not unique, unlike other models 
such as Markov Chains [12] for example. 

The HMMs can be used, as it will be shown in the following paragraphs, both as 
generators and as recognizers of signals. 

3.4.1 HMM as generators 

A HMM can be used to generate a sequence of S*. Let X = x\X2 ■ ■ ■ xt £ S*. This 
sequence can be generated by a sequence of states Q = q\qi . . . qr as follows: 

1. Set i <— 1 and choose the state qi according to the probability distribution ir of 
initial states; 
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2. Assuming to be in the state g, (having already generated x±X2 ■ ■ -Xi-i) produce 
in output Xi according to the probability distribution bi ; 

3. If i < T, then i <— i + 1 and go to the state %+! in agreement with A [i, 1 : N] 
and repeat step 2 otherwise end. 

The probability of observing X = x\Xi . . . xt and the sequence of states Q = 
qiq 2 ■■■qr is: 

T 

P(X,Q)=7r 1 Y[b i (xi)a ii+1 (3.5) 

i=i 
This probability is often not very useful because it is unknown which sequence of 
states has produced the string X (since it is possible to have multiple sequences of 
states that can generate it). Algorithms that solve this problem will be shown later. 

3.4.2 HMM as recognizers 

A HMM can be used as a probabilistic validator of a sequence of S* because it returns a 
measure, in terms of mass of the probability of how well a HMM recognizes or observes 
X. This probability is defined as: 

T N 

p(x\x)= n r,p(qt=i)bi(x t ) 

t=l i=l 



with P(q t =j) 



itj if t = 1 (3.6) 

JV 
Yl P (qt-i = i) a-ijbi (x t -i) 



=i 

As mentioned earlier, the HMM through the transition matrix A induces a multi- 
parted graph. This graph can be represented as a matrix with A^ rows, which correspond 
to N states of A, and for all t > 1 columns t and t + 1 form a complete bipartite graph, 
with arcs directed from vertices in column t to vertices in column £+l(l<£<T— 1). 
The recognition consists of superimpose X over all possible paths of length T in this 
graph (which is called trellis), starting from the vertices in column 1. For a given vertex 
i in column £ on a given path, the measure of how well it is possible to recognize the 
symbol xt consists of two parts: the probability of being in the state P(qt = i) and the 
probability that the state emits the symbol Xt given by bi{xt). 

3.4.3 Problems related to HMM 

Given an HMM model A, three main issues are considered: 

1. Given a sequence of observations X = x\Xi-..xt G S* and a model A = 
(N, M, A, B,tt), calculate the probability of observing the sequence X using the 
model A i.e. P(X\X); 
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2. Given a sequence of observations X = X\x 2 . . . xt £ X* and a model A = 
(N, M, A, B, 7r), choose the corresponding sequence of states Q = q\q 2 ■ ■ ■ qT that 
best explains the observations using the model A and an optimization criterion; 

3. Calculate the values of model parameters A = (N, M, A, B, n) in order to maxi- 
mize P(X\X). 

The first problem is solved efficiently by an algorithm called forward procedures, the 
second by the Viterbi algorithm, while the third by the Baum Welch algorithm. 

3.4.4 Forward procedures 

By using this algorithm, is possible to calculate P(X\X) in 0(N x T x 6 max) where 
5 max is the maximum degree among all HMM states. This algorithm uses dynamic 
programming and consider a variable at(i) defined as: 

a t (i) = P(xix 2 . . -x t ,qt = i\\) (3.7) 

that is the probability that at time t, it is possible to observe the partial sequence 
X1X2 . . . Xt and reach the state i. The procedure consists of three phases: 



Initialization: 



Induction: 



ai (i) = TTtbi (xi) with 1 < i < N (31 



Termination: 



"m (J) = ( E a t (*) a ij ) b j ( x t+i, 

with 
1 < t < T - 1, l<j<N 



N 



(3.9) 



P(X\X) = ^2a T (i) (3.10) 



i=l 

In figure 3.10 the single steps that allow to calculate at+l (j) are shown. The 
number of possible paths grows exponentially with the length of the sequence, so it 
is not possible, in many applications, to consider all paths. For this reason a good 
approximation is to consider only the probability of the most likely path. There is 
also a variant of this algorithm that, at the end of computation, calculates the same 
probability starting from the possible terminal states used to recognize (or generate) 
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«t+i(0 

Figure 3.10: Forward procedure 

the sequence X. This variant, which is called the backward procedures, as well as the 
forward procedure, uses a variable /3j (i) defined as: 

Pt (i) = P (xt+iXt+2 ■ ■ ■ x T \qt = i, A) (3.11) 

that represents the probability at time t, to observe a partial sequence from time £ + 1 
until the end, being in the state i under the assumption of the model A. In figure 3.11 
the single steps that allow to calculate fit (i) are shown. 

3.4.5 Viterbi algorithm 

The Viterbri algorithm provides an efficient solution to the second problem of HMM i.e. 
computing the optimal sequence of states for the recognition of the sequence X with 
the model A. The term "optimum" depends on the particular problem taken in exam. 
In any case, one of the most used criteria is to find the best sequence of states that 
generates X maximizing P (Q\X, A) or equivalently P (Q, X\X) . The Viterbi algorithm 
uses dynamic programming and computes: 

• (3t (i) = max P (qiq2 ■ ■ ■ qt-i, qt = h x\x% ■ ■ ■ Xt\X) i.e. the probability of the 

q\q2...qt-l 

most likely path that takes into account of the first t observations and that ends 
in state i; 

• 7t (i) that represents the state that leads to the state i at time t. 
The procedure consists of four phases: 
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A(0 



1. Initialization: 



Figure 3.11: Backward procedure 



Pi (}) = Kibi (xi) 
71 (*) = 



with 1 < i < N 



(3.12) 



2. Induction: 



Pt (j) = max {Pt-i (i) ciij} bj (x t ) with2 <t <T 

l<i<N 

7 t (j) = arg max {/3 t _i (i) a^}withl < j < N 

Ki<N 



(3.13) 



3. Termination: 



P(Q\X,X)=m^ N {p T (i)} 

q T = argmax{/3 T (i)} 
Ki<N 



(3.14) 



4. Backtracing: 



q t = 7 m (t + 1) , t = T - 1, . . . , 1 



(3.15) 



This algorithm has a computational cost equivalent to O (N x T x 5) where a 
represents the maximum degree of the graph induced by the transition matrix of A. 
Again, as in the forward procedure, the number of possible paths grows exponentially 
with the length of the sequence, making this method not always feasible in the case of 
large amounts of data. 
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3.4.6 Baum Welch algorithm 

The calculation of the values of model parameters A = (JV, M, A, B,tt) that maximize 
P(X\X), is not an easy task. In fact, there isn't any analytical method that solves 
the problem by maximizing the probability of observing the sequence: given a finite 
sequence as a training set, there isn't a perfect way to estimate the parameters of the 
model. However, it is possible to derive a model A = (N, M, A, B,ir) so that P(X\X) 
is locally maximized using an iterative procedure. The best-known iterative procedure 
that solves this problem is the Baum Welch algorithm. To describe how this algorithm 
works first define this function: 



€t(i,j) = P(qt =i,qt+i =J\X,X) 



(3.16) 



i.e. the probability of being in state i at time t and in state j at time t + 1, given 
the model and the sequence of observations X. The sequence of events leading to the 
conditions required by this variable is shown in the figure 3.12. 




t-1 



a ij ty ( x t+l) 



t+1 




t + 2 



Figure 3.12: Baum Welch algorithm 

Obviously, it is clear that looking at the definition of the variables used in the 
procedures of backward and forward, it is possible to rewrite: 



St(i,j) 



o:t(i)aij6j(a:t+i)ft+i(j) 
P(X\\) 
at{i)aijbj(x t +i)Pt+i(j) 

N N 



(3.17) 



Where the numerator is simply the probability P (qt = i, qt+i = j, X\X) . Previously 
at (i) was defined as the probability of being in state i at time t, by observing the partial 
sequence X\X2 ■ ■ ■ Xt- Let's see how at (i) can be defined in terms of £t (i,j) '■ 
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N 

<*t(i) = 52tt(ij) (3-18) 

i=i 

Summing over t the functions at (i) and ^ (i,j) it is possible to obtain: 

T-l 

y at (i,j) = number of expected transitions from state i (3.19) 

t=\ 

T-l 

/ £t (hj) = expected number of transitions between state i and state j (3.20) 

i=l 

Using the defined formulas will be shown now the method for estimating parameters 
for a HMM using the Baum Welch procedure. 
Reasonable estimates for the parameters are: 

Wi = expected number of times to being in state i at time (t = 1) = a± (i) (3.21) 



T-l 

t= i expected number of transitions from state i to state j 

T-l expected number of transition from state Si 

22 a t (i,j) 

t=i 



T-l 

E <h(j) 

t=lAx+=Vu 



bj (k) - -^ r - 
t=i 

expected number of times of being in the state j and observing the simbol v k 
expected number of times of being in the state j 

(3.23) 
these equations can be used in order to develop an iterative process that, starting 
from a model A = (N,M,A,B,ir), allows us to estimate at each step a new model 
A= (N,M,A,B,W). 

In addition it can be proven that: 

• The model A represents a critical point of the likelihood function in the case A = A; 

• The model A is better than the model A, which means that the probability of 
observing X given the model A is greater than the probability of observing X 
given the model A i.e P (X\\) > P (X\\) . 
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These two statements tell us that this procedure converges to a critical point. This 
can be done using iteratively the model A instead of A and repeating the process of 
parameters estimating, gradually increasing the likelihood of the observations of the 
training sequence, until a critical point is reached. The end result of this procedure is 
called the maximum likelihood estimate of a HMM. It is important to underline that this 
algorithm leads to a local maximum point, and in many real application the surface 
to optimize is very complex and has many local maxima. The formulas to estimate 
parameters can also be derived directly from the Blum's auxiliary function Q (A, A) in 
respect to A; this function is defined as: 

Q (A, A) = J2 P (Q\X, ^ log [P (X, Q\ A)] (3.24) 

Q 

It can be proven also that the maximization of the function increases the likelihood: 

max [Q (A, A)] => P (X\J) > P {X\X) (3.25) 

A 

3.4.7 The proposed HMM for nucleosome positioning 

As mentioned earlier in [90] the problem of identifying the nucleosome using data from a 
process of microarray hybridization and modeling observations with a particular HMM, 
was addressed. This is because a simple thresholding technique has not sufficient ac- 
curacy because of noise and trend in the data. The proposed model for the detection 
of nucleosomes in chromatin regions is shown in figure 3.13. In this model, several dif- 
ferent states for different types of nucleosomes with special connections are considered; 
in particular the states model the sequences of chromosomes corresponding to a linker 
(state L), well-positioned nucleosomes (states N±, A^, ..., Ng) and delocalized nucleo- 
somes (states DNi, DN2, ..., DNg). The values of the measures that can be observed 
by each state correspond to the physical values that the system outputs, which in this 
case represent the logarithmic ratio between the intensity of red and green for each 
spot of the microarray. The transition matrix that establishes which are the allowed 
transitions between states and their probabilities, is estimated with the Baum Welch 
algorithm together with the other parameters. In this model there is only one state that 
represent the class of probes corresponding to linker regions, and this state has a loop 
in order to model variable length linker regions. The number of states for the class of 
well-positioned nucleosomes in this model is 8. This choice is justified considering the 
length of a nucleosome in normal conditions (about 6-8 probe). In this way, the infor- 
mation about the expected length of a nucleosome is encoded in the model. Similarly, 
it is possible to note that the number of states for the class of delocalized nucleosomes 
in this model is 9 and the last state has a loop (similar to the state linker) in order to 
model the different lengths of nucleosomes regions that cover a number of probe greater 
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probabilif 



k. 



HMM 



J 



Figure 3.13: HMM topology for nucleosome positiong 

than 9. Finally, a well-positioned nucleosomes in this model have a length between 6 
and 8 probes, the delocalized nucleosomes have a number of probes equal to or greater 
than 9, and linkers have a variable length greater or equal to one. 



3.5 Second solution: MLA 

In this section the application of MLA to face the problem of identifying and classifying 
nucleosomes will be described. The following subsections will show the various steps 
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that allow the classification of the nucleosomes identified trough the MLA and the 
construction of a model for well-positioned nucleosomes. Firstly, let's recall that the 
signal S is divided into segments in which probes can be not contiguous (due to data 
referring to different regions of chromosomes, or missing data). In particular S is 
organized in T contiguous fragments Si, ■ ■ ■ ,St which represent DNA sub-sequences. 



3.5.1 Preprocessing 

In the first stage of processing a convolution process is applied in order to reduce the 
noise in the signal. The smoothing is done for each probe segment corresponding to 
adjacent regions of the signal i.e each fragment St, 1 < t < T of the input signal, 
S, is smoothed by a convolution operator that perform the weighted average of three 
consecutive signal values, where the weights are provided by the kernel window w = 
[J, J, J] [52]. 



3.5.2 Creating the model 

The construction of the model represents a phase of training, where it is possible to 
learn the shape of the pattern corresponding to the nucleosome considering only the 
regions that corresponds with high probability to well-positioned nucleosomes. Since 
well-positioned nucleosomes are shown as peaks of a bell shaped curve, in order to locate 
the position of a nucleosome, all local maxima of the input signal are automatically 
extracted from the convolved signal X of S. Then a subset of maxima are opportunely 
selected for the model definition. Each convolved fragment X% is processed in order 
to find L(Xt) local maxima M 4 for I = 1, • • • ,L(Xt). The extraction of each sub- 
fragment for each M t is performed by assigning all values in a window of radius os 
centered in M{ ' to a vector, FJ of size 2 x os + 1: FJ(j) = X t (MJ: ' — os + j — 1), for 
j = 1, 2, ..., 2 x os + 1. The selection process extracts the significant sub- fragments to 
be used in the model definition. This is performed by satisfying the following rule: 



'Fl(j + 1)-Fj(j)>0 j = !,-■■ ,os 
Fl<j + 1) - Fj(j) < j = os + 1, ■ ■ ■ , 2 x os 



1 U (3.26) 



This condition is equivalent to verify that the signal in that fragment is increasing to 
the right of the maximum and descending to the left (condition of convexity). If the 
pattern respects this condition, it will be used for the next phase of construction of the 
model of the well-positioned nucleosome. The process continues in a similar way for the 
other points of relative maximum (if present) in the segment considered in descending 
order. After this selection process G(Xt) sub-fragments remain for each Xf. The model 
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of the interesting pattern is then defined by considering the following average: 

T , G(X t ) 



m 



Tj^ G(X t ) f^ Ft ^') j 



1. 



, 2 x os + 1 



(3.27) 



That is, for each j, the average value of all the sub- fragments satisfying Eq. 3.26. 
The model then will represent the average pattern of a well-positioned nucleosome 
through its expected shape. Applying this procedure a model shown in figure 3.15(a) 
is carried out averaging the pattern in figure 3.14(b). 



10r 




Figure 3.14: Patterns that meet the condition of convexity 



3.5.3 Interval identification 

This step is the core of the method i.e. the interval identification obtained by the 
Simply Equally spaced MLA presented in chapter 2. In particular by considering K 
threshold levels t^ (k = 1, ..., K) of the convolved signal X, for each tj. a set of intervals 



R k ~ {4' 1 !' 



r-n-k 
"'fe 



} is obtained, where, ij. 



[&£,4]andX( 



X(el) = t k . This 



set of intervals as explained in chapter 2 constitutes the interval representation T(X) of 
the input signal X. In Section 3.5.9 a calibration procedure to select the proper value 
of K is described. 
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Figure 3.15: Model of well-positioned nucleosome 



3.5.4 Aggregation rule and Pattern Definition 

This step is performed by taking into account that bell shaped pattern must be extracted 
for the classification phase. Such kind of patterns are characterized by sequences of 
intervals < ij, J?, 1; • • • , I^ +l > such that P- D P~t_i', more formally a pattern Pi is defined 
using the following aggregation rule: 



Pi = {I) 



ylj+l 
• 1 j + l ' 



fj + l 

± 3+l 



vr k k 



3!/ G R k+1 : I 



T-'fc+i (- pk' 
1 k+l 



CI^} 



(3.28) 



where, j defines the threshold, tj, of the widest interval of the pattern. From the 
previous definition it follows that Pi is build by adding an interval 1^+1 on ly if if is 
the unique in Rk+i that is included in it fc . Note that, this criterion is inspired by the 
consideration that a nucleosome is identified by bell shaped fragment of the signal, and 
the intersection of such fragment with horizontal threshold lines results on a sequence 
of nested intervals. In figure 3.16 two examples of shapes with the relative patterns are 
shown. 



3.5.5 Pattern selection 

In this step the interesting patterns p( m ) are selected following the criterium: 
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Figure 3.16: Two different shapes of the input signal: (on the left) Since at threshold 
level K + 1 the interval R^ = {/#-} has two subset Rk+i = {.Ir+i^k+i} i ^ i s possible 
to set three pattern Pi = {ij^}, Pi = {^ +1 } and P$ = {I^ +1 }- (on the right) In 
this case, l]^+i is the unique subset of ij^, thus it is possible to set an unique pattern 



p( m )={P, : \Pi\>m} 



(3.29) 



i.e. patterns containing intervals that persists at least for m increasing thresholds. This 
further selection criterion is related to the height of the shaped bell fragment, in fact a 
small value of m could represents noise rather than nucleosomes. The value m is said 
the minimum number of permanences; in subsection 3.5.9 a calibration procedure to 
estimate the best value of m is described. 

3.5.6 Feature extraction 

Each pattern P { £ p( m ) is identified by ijVJ+iV" ,lj+/, with I > m. Straightfor- 
wardly, the feature vector of Pi is a 2 x / matrix where each column represents the lower 
and upper limits of each interval from the lower threshold j to the upper threshold 
j + 1. The representation in this multi-dimensional feature space is used to characterize 
different types of patterns. 

3.5.7 Dissimilarity function 

A dissimilarity function between patterns is defined in order to characterize their shape: 



S(P r ,P s ) = (1 - a)(A r - A s ) + aY J K l - a?) 



(3.30) 



iai 



where, A r and A s are the surfaces of the two polygons bounded by the set of vertexes 
y = Uie/{(fr?' e D' (kiS e i')}> a 7 = e 7 ~K l i °? = e i* - &i*> an d « is a user parameter 



ranging in the interval [0, 1] to set the weight of the two dissimilarity components. 

The first component of this dissimilarity allow us to consider patterns of close dimen- 
sions, while the second component has been introduced to include shape information 
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since it can be considered a correlation measure of the two bounding polygons. This 
dissimilarity can be used by a general classifier in order to distinguish the kind of pat- 
tern. An example of input signal and the extracted interesting patterns is given in 
figure 3.17. 

3.5.8 Nucleosome Classification 

With the MLA, one is able to classify four "refined nucleosomal states": linkers, well- 
positioned, delocalized and fused nucleosomes. (see figure 3.18). In the following, the 
classification rules which allow us to automatically discriminate such kind of patterns 
are stated. The classification was conducted in two steps, in the first step the linker pat- 
terns, the expected well-positioned patterns and expected delocalized patterns are found. 
Afterwards, the ranges of the regions representing the expected well-positioned and 
delocalized nucleosomal patterns are set, defining the expected regions. Finally, the 
classification is performed by testing the intersection of such regions (see figure 3.19). 
First phase: 

For each interesting pattern Pi, the dissimilarity 5{Pi,F) is evaluated (5 is defined in 
Eq. 3.30, F is the model), the rule to classify Pi is : 



Cl(Pi) 



L 


XSiPi^Kfa 


EW 


if 0i < 6(Pi,F) < fa 


ED 


otherwise 



(3.31) 



where L means linker pattern, EW or ED are nucleosomal pattern, and in particular 
expected well-positioned patterns and expected delocalized patterns respectively. 

Second phase: Afterwards, for each expected well-positioned nucleosomal pattern 
Pi = {I- j , I/Zi , ■ ■ ■ , IJ±i} ( e -g- Ci(-Pj) = EW), the center of the nucleosomal region Ci 
is calculated: 

i -i +l p i _i_ yi 
a = y£^P (3.32) 

k=j 

which represents the mean of the first I intervals defining the pattern Pj. Conversely, 
for each expected delocalized nucleosomal pattern (e.g. ci(Pj) = ED), the delocalized 
interval [B l ,E l ] is defined such that: 

1 i+a/2) i+(i/2) 

Bi= m E ^andP* = — Y, 4 (3-33) 

1 k=j ' k=j 

Note that, B % and E l represent respectively the mean of the first 1/2 beginning and 
ending of each interval belonging to the pattern Pj. The expected regions is so defined: 
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A = UCi(l)-3,Ci(l) + 3\ iic 1 (P i ) = EW 
[[&,&] otherwise 

In particular, each expected region Ai is, in the case Pi is an expected well-positioned 
pattern, an interval with beginning 3 probes before and ending 3 probes after the center 
Cj, otherwise it is the interval [B l ,E 1 ]. Finally, the classification rule is: 



c 2 (Pi) 



F if AiDAj ^ j ^i otherwise 

W \{ Cl {Pi) = EW (3.35) 

D \i Cl {Pi) = ED 



where F, W and D stands for fused, well-positioned, delocalized nucleosomes respectively 
(see figure 3.18). Informally, the classification rule in Equation 3.35 assign the fused 
class if the expected nucleosomal regions overlap otherwise confirm the classification of 
the first phase. 

3.5.9 Parameter selection by calibration 

In order to set the proper values of K (number of thresholds), and m (the minimum 
number of permanences), a calibration procedure has been used. In particular, such 
values has been estimated by studying the plots of particular functions able to measure 
the goodness of several K and m. 

3.5.9.1 Estimation of m 

The minimum number of permanences m has been estimated by using the synthetic 
signal generator described above. This gives the opportunity to make a massive ex- 
perimental study on the relation between K and m. In particular, c = 10 copies at 
different signal to noise ratios j = 1,2,4 has been generated, resulting in a total of 3 x 10 
synthetic signals Vy. Once fixed a signal to noise ratio j, for each Vij the value of m 
which maximizes the recognition performances for several thresholds for k = 20, • • • ,50 
has been found. 

Figure 3.20 shows the results performed by considering c = 10 copies, three signal 
to noise ratio values 1,2,4, and k = 20, • • • ,50 thresholds. In each plot, the x axis 
represents the number of thresholds k (i.e. number of cuts), the column bar groups 
the best recognition and the percentage of minimum number of permanences which 
causes the best performances on all the 10 experiments. From this experimental study, 
it emerges that the use of an high number of thresholds can compromise the recognition 
process (due to the fact that an high value of K can capture also the noise present in 
the signal), moreover, the m value seems not dependent from K, and the one which 
causes the best recognition ranges in an interval of [0.15 x K, 0.30 x K]. 
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3.5.9.2 Estimation of K 

The proper value of K is estimated starting from the convolved input signal X. Giving 

a convoluted signal fragment Xt it is resampled it in the y direction resulting in several 

(k) 
samples X} for different threshold values k = 1, • • • ,K max . It is possible to measure 

the goodness of k by the average normalized correlation g(k) and the average missing 



probes MS{k) so defined: 



m l£ i±i M) (3 , 6) 



T ^ 2 



i T 

MS(k) = -Y,MS(k,t) (3.37) 



4=1 



In particular g(k) measures the average normalized correlation between each resamplc 
X\ and the generic fragment Xt (p is the correlation coefficient), while MS(k) the 
average of the missing probe values MS(k, t) due to the resample of Xt by k thresholds. 
Finally the value K is selected interactively by looking both at the plots of 7j and 
MS*, searching for the best compromise of maximum ~g and minimum MS (see figure 
3.21). In this way the signal obtained has an high correlation with the original signal 
and a reasonable number of missing samples in order to not capture the noise present 
in the signal. 

3.5.10 Synthetic generation of biological signals 

Before validating the MLA approach on biological data, a procedure to generate syn- 
thetic signal has been developed in order to assess the feasibility of the method on 
controlled data. Generated signals emulate the one coming from a tiling microarray 
where each spot represents a probe i of resolution r base pairs overlapping o base pairs 
with probe i + 1. In particular, the chromosome is spanned by moving a window (probe) 
i of width r base pairs from left to right, measuring both the percentage of mononu- 
cleosomal DNA Gi (green channel) and whole genomic DNA Ri (red channel) within 
such window, respecting also that two consecutive windows (probes) have an overlap 
of o base pairs. The resulting signal V(i) for each probe i is the logarithmic ratio of 
the green channel Gi to red channel R{. Intuitively, nucleosomes presence is related to 
peaks of V which correspond to higher logarithmic ratio values, while lower ratio val- 
ues shows nucleosome free regions called linker regions. This genomic tiling microarray 
approach takes inspiration from the work of Yuan et al. [90] where the authors have 
used the same methodology on the Saccharomyces cerevisiae DNA. Here it is defined a 
model able to generate such signals characterized by the following parameters: 

• nn: The number of nucleosomes to add to the synthetic signal. 
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• nl: The length of a nucleosome (in real case a nucleosome is 146 base pairs long) 

• A: Mean of the Poisson distribution used to model the expected distances between 
adjacent nucleosomes; 

• r: The resolution of a single microarray probe. 

• o: The length in base pairs of the overlapping zone between two consecutive 
probes. 

• nr: The number of spotted copies (replicates) of nucleosomal and genomic DNA 
on each probe of the microarray; 

• dp: The percentage of the delocalized nucleosomes over the total number of nu- 
cleosomes; 

• dr: The range which limits the derealization of a nucleosome in each copy of nr. 
It is defined in base pairs. 

• nsv: The variance of the green channel in each probe, even in absence of nucleo- 
somes due to the cross hybridization. This variance follows a normal distribution 
with mean 0.1. 

• pur: The percentage of DNA purification, which is the probability that each single 
DNA fragment of the nr copies appears in the microarray hybridization. 

• ra: Relative abundance between nucleosomal and genomic DNA. 

• SNR: The linear signal to noise ratio of the synthetic signal to generate. Note 
that the noise is assumed to be gaussian. 

Initially, a binary mask signal M is generated by considering as l's all the base pairs 
representing a nucleosome (the nucleosomal regions) and as 0's the regions represent- 
ing linkers (the linker regions). Note that, the beginning of each nucleosomal region is 
established by the Poisson distribution with mean A. The mask signal M will be used 
in order to validate the classification results. The red channel of the microarray (the 
genomic channel) results from the generation of nr replicates I±, ■ ■ ■ , I^ r each one start- 
ing from an initial nucleosomal region of random size b ~ ?7(0, r) (uniformly distributed 
in the range [0, r]), followed by continuous nucleosomic region of r base pairs. Con- 
versely, in order to simulate the green channel (the nucleosomic channel) nr replicates 
, Ij , • • • , l!^ r are considered, each one initially equal to M and subsequently modified 
by perturbing each starting points x l D of the nucleosome to consider as delocalized such 
that x l D = x l D + [i with random \x ~ U(dr). Note that the percentage of nucleosomes 
to consider as delocalized is established by the parameter dp. Afterwards, each nucleo- 
somal region on the generic replicate if 1 and if can be switched off depending on the 
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value of a random variable a ~ U(0, 1). Precisely, each nucleosomal region veryfing the 
test a < pur is considered and set to 1, otherwise it is not considered and set to 0. This 
results in new replicates T^ and T^ . Finally, the generated synthetic signal V for a 
probe i is so defined: 

V(i) = {log 2 (ZT=i ^P + s)\ (r - o)i - r + o + 1 < k < (r - o)i + o} (3.38) 
where e ~ iV(0.1, nsv). In figure 3.22 it is possible to see the steps of this process. 

3.6 Results 

The following experiments have been carried out by measuring the correspondence be- 
tween nucleosome and linker regions. In the case of the synthetic signal, the output of 
the classifier has been compared with a mask M' derived from M while in the case of 
the real data set it has been compared with the output of the HMM for nucleosome 
positioning (see section 3.4) optimally converted into a binary string. 
In all the experiments, the same value (<f>i,4>2) = (mean(5(FJ:,F)) — 
3std(5(F]:,F)),mean(5(Ft,F)) + 3std(5(FJ:,F)) has been considered, where Fj are all 
the sub- fragments used on the construction of the model F. Moreover, by biological 
consideration, the radius os has been set to os = 4. The performances have been eval- 
uated in terms of Recognition Accuracy, RA. The RA uses a new mask M' obtained 
by converting M into probe coordinates such that a probe value is set to 1 (e.g. shows 
a nulceosome portion) if the corresponding base pairs in M include at least a 1. The 
real nucleosomal (linker) regions RNR (RLR) are represented by M' as contiguous 
sequence of l's or 0's respectively, here a nucleosomal (linker) region CNR (CLR) has 
been classified correctly if there is a match of at least I = 0.7 x L contiguous l's (0's) 
between CNR (CLR) and the corresponding RNR (RLR) in M' where L is the length 
RNR (RLR). The value 0.7 has been chosen because it represents a 70% of regions 
overlap very unlikely to be due to chance. 

3.6.1 MLA vs HMM on Synthetic Nucleosome Positioning data 

For MLA, we have chosen by the calibration phase K = 20 and m = 5, the value 
of a in Eq. 3.30 has been set to 0.5 to equally balance the two component of the 
dissimilarity. In particular, 6 signals of length ranging from 2337 probes (70130 base 
pairs) to 2361 probes (70850 base pairs) have been generated for the signal to noise ratio 
values 1, 2, 4, 6, 8, 10. In Fig. 3. 23 the results of the total RA for all the experiments are 
reported. The confusion matrices of HMM and MLA for all the experiments are 
reported in the tables 3.1 and 3.2. In Fig. 3. 23 the results of the total RA for all the 
experiments are summarized. Fig. 3. 23 shows that the HMM is slightly more accurate in 
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finding the bounds of the nucleosome regions. The synthetic results can be summarized 
in an overall RA of 0.96 for the MLA and 0.98 for HMM. 
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Table 3.1: Confusion matrices of HMM on 6 different signal to noise ratios for nucle- 
osome (N) and linker (L) regions. 
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Table 3.2: Confusion matrices of MLA on 6 different signal to noise ratios for nucleosome 
(N) and linker (L) regions. 



3.6.2 MLA vs HMM on real data 

In this experiment, it has been compared the accordance of the two models on the 
Saccharomyces cerevisiae real data. The input signal representing this data is composed 
by 215 contiguous fragments for a total of 24167 base pairs. In such experiment, K = 40, 
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m = 6 were chosen by the calibration phase (m = 0.15 x 40) and a = 0.5 was considered 
to equally balance the two components of the dissimilarity (see the definition in Eq. 
3.30). The confusion matrices which show the RA of HMM considering MLA as the 
truth classification and RA of MLA considering HMM as the truth classification are 
reported in table 3.3. The results can be summarized in an overall RA of 0.83 for the 
HMM (MLA true) and 0.69 for MLA {HMM true). In particular, from this studies 
it is possible to conclude that MLA does not fully agree with HMM on the linkers 
patterns. Remarkably, comparing MLA and HMM on the data coming from recently 
developed deep sequencing approach {DS) [2] it is possible to see a better agreement 
with MLA (0.58) rather than with HMM (0.44) (table 3.4 and figure 3.24). These 
analysis indicate that the integration of the HMM and MLA could improve the overall 
classification. 
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Table 3.3: Agreement between the HMM and MLA (and viceversa) on the Saccha- 
romyces cerevisiae data set for Nucleosomes (N) and Linker (L) regions. The table on 
the left shows the RA results of HMM when considering MLA as the truth classifica- 
tion, while the opposite is shown on the right table. 
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Table 3.4: Confusion matrices of MLA and HMM on deep sequencing approach (DS) 
data by Pugh et Al. (2007). 



3.6.3 Scalability and computational time of MLA and HMM: 

This point is fundamental because the size of a problem can vary significantly into 
this application domain, and if our method is not able to scale well it could become 
totally useless. The computation time of MLA and HMM have been compared on 10 
experiments. In particular, 10 synthetic signals have been generated, each one with a 
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fixed number of well-positioned nucleosomes ranging from 10 to 100 by step of 10. In 
figure 3.25, the ratios between the execution time of ML A (T m ) and HMM (T/j) for each 
experiment is shown. From this study, it results that, on average, T^ = 1.7 x 10 4 x T m . 

3.7 One- Class Classifier and ML A 

One of the key point of the MLA methodology applied on the case of nucleosome po- 
sitioning, is the classification phase that is applied after the discovery phase. In this 
section a new classification schema that take advantage of MLA will be presented. As 
explained in chapter 1 classification algorithms bases the construction of their discrimi- 
nating function on a training set that contains several examples for each class (or in the 
particular case of binary classification this means that are necessary both positive and 
negative examples). However, in many cases either only examples of a single class are 
available or the classes are very much unbalanced. To address this particular problem 
one-class classifiers have been introduced in order to discriminate a target class from the 
rest of the feature space [80] . The approach is based on finding the smallest volume hy- 
persphere (in the feature space) that encloses most of the training data. This approach 
is mandatory when only examples of the target class are available or the cardinality of 
the target class is much greater than the other one so that too few training examples 
of the smallest class are available in order to properly train a classifier. It is important 
to pinpoint that the nucleosome positioning data considered, involve necessary the use 
of a one-class scheme, since a training set of only well-positioned nucleosome is avail- 
able. This section present, a one-class classifier schema, in particular a one-class KNN 
(OC — KNN) in order to distinguish between nucleosome and linkers. The performance 
of the one-class KNN embedded in the MLA analysis, has been tested on the same 
kind of data previously described. Results have shown, in both cases, a good recognition 
rate. 

3.7.1 One-Class classifiers 

The first algorithms for one-class classification were based on neural networks, such as 
those of Moya et al. [58, 57] and Japowicz et al. [38]. More recently, one-class versions 
of the support vector machine have been proposed by Scholkopf et al. [68]. The aim 
is to find a binary function that takes the value +1 in a small region capturing most 
of the data, and -1 elsewhere. Data transformations are applied such that the origin 
represents outliers, then the maximum margin, separating hyperplane between the data 
and the origin, is searched. 

The application of machine learning to classification problems, that depends only on 
positive examples, is gaining attention in the computational biology community. This 
section lists some applications of one-class classifiers to biological and biomedical data. 
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In [89] a study using one-class machine learning for microRNA (miRNA) discovery 
is presented. Authors compare a One-class KNN to two-class approaches using naive 
Bayes and Support Vector Machines. Using the EBV genome as an external validation 
of the method they found one-class machine learning to work as well as or better than 
a two-class approach in identifying true miRNAs as well as predicting new miRNAs. 
In [59] a general method for predicting protein-protein interactions is presented. The 
search of feasible interactions is carried out by a learning system based on experimen- 
tally validated protein-protein interactions in the human gastric bacterium Helicobacter 
pylori. The author shows that the linear combination of discriminant classifier provides 
a low error rate. 

In [62] a one-class classification problem is applied to the detection of diseased mucosa 
in oral cavity. Authors either combine several measures of dissimilarity of an element 
from a set of target examples in a single one-class classifier or combine several one- 
class classifiers trained with a given measure of dissimilarity. Results show that both 
approaches achieve a significant improvement in performance. 

3.7.2 One-Class KNN 

Here, the one-class classifier named One-class KNN will be described. A KNN classi- 
fier for an M classes problem is based on a training set T for each class m, 1 < m < M. 
The assignment rule for an unclassified element x E X is: 

j = argmax \ T^J 1 (x) | (3.39) 

l<m<M 

where, Ti- ' (x) are the training elements of class m in the K nearest neighbors of x. 

One of the crucial points of the KNN is the choice of the best K, which is usually 
obtained minimizing the misclassification rate in validation data. 

In the case of a binary classification (M = 2), one-class training means that in the 
decision rule can be used examples of only one-class. Here, a one-class training KNN 
(OC — KNN) is proposed and which is a generalization of the classical KNN classifier 
[37]. Let T p be the training set for a generic pattern p representing a positive instance, 
and 5 a dissimilarity function between patterns. Then the membership for an unknown 
pattern x is: 

(l if \{y&T p such that 5(y,x) < </>}\ >K 
X<P,k(x) = < (3.40) 

I U otherwise 

Informally, the rule says that if there are at least K patterns in T p dissimilar from x at 
most <fi, then x is supposed to be a positive pattern, otherwise it is negative. 
It can be simply proved that the OC — KNN has some interesting properties: 
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Proposition 3.7.1 Let D a dataset of patterns, T p C D the training set for the 
positives, S^k = {x € D\x<j>,k( x ) = 1} the set with membership x<f>,K, then: 

a) S^k* C S^ K VA' > K 

b) S^K C Sp t x V(/> < $ 

The one-class KNN performances depends on the threshold, (f>, and the number of 
neighbors, K, that are used in the classification phase. Both of them can be determined 
by using a validation procedure applied on the training set of positives Tp. In the 
following, it will be described the procedure used to estimate the best pair ((p*,K*). 
Let us define the performance function M: 

Note that, in this validation procedure Vx £ T p assigned to S^k use the membership 
X<P,k(x) defined on the training set T p — {x}. By using M it is possible to define the 
functions P and Q 

P{<f>)= Yl M((j),k) and Q(k)= ^ M(<f>,k) (3.42) 

k£{K m ,K M } 4>£{4>m,<t>M} 

where {<j) m , 4>m} and {K m ,KM} are sets of increasing values of thresholds and number of 
neighbors respectively. By applying the proposition 3.7.1, it results that the function M 
increases while the threshold <j> increases, and decreases while the neighbors K increases. 
In figure 3.26(a) a 3d plot of the function M relative to the classification of nucleosome 
and linker regions on the Saccharomyces cerevisiae data set is shown. Assigning the 
values, cj) m = min X) y eTp 5(x,y) and (f> M = max X:y£Tp 5(x,y), K m = 1, K M = \T P \, the 
pair ((f)*, K*) to choose is: 

cf>* = min{4> | P(<f>) = max{P((f>)}} (3.43) 

K* = max{K | Q(K) / 0} (3.44) 

Informally, such estimation methodology selects the smallest threshold (ff which 
causes the best performances on the validation data, most independently from the values 
of K. Moreover, the value K* is chosen to be the largest one causing performances 
different from zero. In this way it is possible to obtain a good compromise between 
the generalization ability of the classifier and its precision, in fact the best value of (f) 
takes in account of several values of K and the value of K chosen should guarantee 
a good generalization ability. In figure 3.26(b) an image representation of M shows 
also the chosen (<f>* ,K*) concerning the classification of nulceosome and linker regions 
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on the Saccharomyces cerevisiae data set. A fuzzy extension version of the OC — 
KNN, has been recently tested on two public data-sets [22], studying also the gain 
in classification performances when combining several one-class classifiers defined by 
different dissimilarity functions. 

3.7.3 Results on synthetic data 

Also in this case, the performances have been evaluated in terms of Recognition Ac- 
curacy, RA (see section 3.6 for details). The synthetic experiments allows to test the 
robustness of the OC — KNN to signal noise. All parameters used in the generation 
of synthetic data have been inspired by biological considerations and are nn = 200, 
nl = 250, A = 200, r = 50, o = 20, nr = 100, dp = 0, dr = 0, pur = 0.8, nsv = 0.01, 
SNR = {1, 2, 4, 6, 8, 10} and ra = 4, resulting in 6 synthetic signals at different SNR. 
The training set Tp is represented by all WPN's that fit better the conditions in Eq. 
3.26 with os = 4, because, by biological consideration, it is known that a nucleosome 
is around 150 base pairs which corresponds to 8 probes. Thus, the training set T p 
and consequently its size TL, are automatically selected by the MLA depending on 
the generated input signal, resulting that, for the specific experiments reported here, 
TL = {63,98,127,142,145,147} for SNR = {1,2,4,6,8,10} respectively. The opti- 
mal parameters for the MLA are derived by a calibration phase described in [16] and 
have resulted H = 20 and m = 5. Here and in the next section H represents the 
number of threshold operations of MLA analysis in order to avoid ambiguities with 
the K of OC-KNN that represents the number of neighbors. The performances have 
been evaluated measuring the correspondence between the classified WPN or LN re- 
gions and the ones imposed in the generated signal. The parameters ((f)* , K*) of the 
OC — KNN has been chosen by the validation procedure described in section 3.7.2 for 
each SNR = {1,2,4,6,8, 10}. Figure 3.27 reports the best Accuracy and FPR values 
versus SNR, showing also, for each SNR signal, the ((f)* , K*) causing such values. 
From this study, it results that the average accuracy and FPR over the 6 experiments 
is 94% and 9% respectively. 

3.7.4 Results on real data: 

In this experiment, it has been again compared the accordance of the Hidden Markov 
model (HMM) for nucleosome positioning on the Saccharomyces cerevisiae real data. 
The training set Tp has been decided in the same way as above. In such experiment, 
H = 40, m = 6 were chosen by a calibration phase (m = 0.15x40) that is fully described 
in [16]. The confusion matrices, which show the RA of HMM considering MLA as the 
truth classification and RA of MLA considering HMM as the truth classification, are 
reported in table 3.5. The results can be summarized in an overall RA of (0.76) for the 
HMM (MLA true) and 0.65 for MLA (HMM true). 
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Table 3.5: Agreement between the HMM and MLA (and viceversa) on the Saccha- 
romyces cerevisiae data set for Nucleosomes (N) and Linker (L) regions. The table on 
the left shows the RA results of HMM when considering MLA as the truth classifica- 
tion, while the opposite is shown on the right table 



In particular, from this studies it is possible to conclude that MLA does not fully 
agree with HMM on the nucleosome patterns as in the previous case, in addition seems 
comparing the tables 3.5 and 3.3, that this classifier doesn't introduce any significant 
improvement than the one used in Section 3.5.8. 
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Figure 3.17: (a) Input signal, smoothing, pattern identification and extraction: A 
Saccharomyces cerevisiae microarray data portion. Each x value represents a spot 
(probe) on the microarray and the corresponding y value is the logarithmic ratio of 
its Green and Red values. Nucleosomes regions are around the peaks signal (one is 
marked by black circle), while lower ratio values show linker regions (marked by dashed 
circles). The dashed lines represents the threshold levels, in this example 6 patterns are 
retrieved, identified by rhombus, circle, square, triangle down, triangle up, star. Each 
pattern identifier is replicated for each of its feature values and pointed in each one of 
its middle point, (b ) An example of classification: In this portion 5 nucleosome regions 
are shown together with its range in base pairs. In particular 1 out of the 5 regions is 
classified as delocalized while the remaining well-positioned. 
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Figure 3.18: Shapes of the patterns: The three classes of nucleosomes it is possible to 
detect with the MLA very likely reflect different nucleosome mobility existing in vivo at 
specific chromatin loci. Delocalized nucleosomes probably represent single nucleosomes 
or arrays of nucleosomes with high mobility, while fused nucleosomes may reflect a single 
nucleosome that occupies two distinct close positions in different cells. On the left of 
the arrows, the particular nucleosome configuration which generates the resulting shape 
of well-positioned (W), delocalized (D) and fused (F) nucleosome classes are shown. 
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Figure 3.19: Classification: The classification of a generic pattern Pi is performed into 
two phases. In the first phase the linker (L), the expected well-positioned (EW) and 
the expected delocalized (ED) patterns are established by using the classification rule 
defined by c\. In the second phase, the expected regions A4 are defined by opportunely 
processing EW and ED patterns, and afterwards used by the classification rule c<i in 
order to finally classify between well-positioned (W), delocalized (D) and fused (F) 
nucleosomes. 
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Figure 3.20: Calibration phase for the choice of m: Recognition performance plots 
(group a) and percentage of minimum number of permanences plots (group b) for 3 
different signal to noise ratios, SNR = 1,2,4 (first, second, third column respectively). 
The bar in each plot groups the results for 10 experiments occurring at several threshold 
values (i.e number of cuts). 
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Figure 3.21: Calibration phase for the choice of K: The value for K is selected inter- 
actively by looking both at the plots of ~g and MS 
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Figure 3.22: An example of synthetic signal generation. 
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Figure 3.23: Results on synthetic data: The Recognition Accuracy of MLA and HMM 
on 6 synthetic signals generated at signal to noise ratios 1,2,4,6,8, 10. 
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Figure 3.24: A representative sample windows spanning 13 nuclesome where the agree- 
ment (disagreement) of the three methods is shown. The red draw represents the 
classification done by Pugh et Al. (2007) in [2] 



3.7. One-Class Classifier and MLA 



79 



x -]g 4 Comparison of MLM and HMM computation time 



fc; 




10 20 30 40 50 BO 70 00 90 100 
Number of well positioned nucleosomes 



Figure 3.25: Computation time performances: The execution time ratio T^/Tm of the 
MLA (T m ) and HMM (Th) for 10 synthetic signal generated with different number of 
well-positioned nucleosomes. The dashed line shows the average execution time. 
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Figure 3.26: Two different representations of M, on the left (a) a 3d plot, on the right 
(b) an image representation showing the values of M using grayscale (0 is black, 1 is 
white). In this latter figure, there are also the chosen pair (</>*,.&**) 
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Figure 3.27: Best Accuracy and FPR values versus SNR. The couples (</>, K) causing 
such results are also reported. 



Chapter 4 

Test of Randomness by MLA 



This chapter presents a new nonparametric test of randomness of a set of one- 
dimensional signals that take advantage of MLA preprocessing step. In particular, 
this procedure is based on the probability density function of the symmetrized Kull- 
back Leibler distance, estimated via a Monte Carlo simulation on the intervals lengths 
obtained by MLA. The main advantage of this new approach is that it allows to perform 
an exploratory analysis in order to verify directly the presence of several structures in 
an input signal. In particular this test differs from the other approaches because it 
exploits shape features that are rare in a random signal. 



4.1 Test of Randomness 

Given a signal or a sequence of symbols, it is first necessary to define the meaning of 
"random". In fact the term randomness has several meanings as used in several different 
fields. A good literature survey about randomness tests can be found here [67]. In the 
statistic literature, the concept of randomness is somewhat related to a sequence of 
random variables. The non randomness could be suggested by any tendency of the 
observation to exhibit regularities in the sequence of observations. For example, if an 
observation in a sequence is influenced by the previous observations or, more in general, 
if the observed value in a sequence is influenced by its position, the process is not 
truly random. More formally, a generic sequence is said random in statistical context 
if the process that has generated it, produces independent and identically distributed 
observations or i.i.d.. In some context, it is typical that the observations are not truly 
random in rigorous statistical sense i.e. i.i.d, but although the sequence are not formally 
random, it could be of interest to measure, fixed a certain degree of confidence, how 
close to random it is. The application of these approaches are manifold: for example a 
test of randomness can be useful in the case of exploratory analysis in order to verify 
the possible presence of structures in an input signal; in the context of cryptography to 
assess the performance of a good pseudo-random generator (because it is a fundamental 
building block in a lot of algorithms) or can be used to test the strength of a password 
[35, 30]. 
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4.1.1 State of the art 

This section does not pretend to be a detailed revision of all the methodologies known in 
literature; the main ideas and their references will be presented instead. In particular 
in statistic literature, there are several approaches to test if a sequence is random, 
exploiting the "non randomness" in different ways: 

• test based on runs 

• test based on entropy estimator 

• test based on ranking 

• test based on goodness of fitting to a given distribution 

It will be shown that the test of randomness that uses the MLA as preprocessing 
step belongs to the last class. 

4.1.2 Test based on runs 

These tests are based all on the central concept of run given in the following definition: 

Definition Given an ordered sequence of one or more symbols, a run is defined to be a 
succession of one or more type of symbols which are followed and preceded by different 
symbols or no symbol at all. 

Once the runs in the signal are identified, the measure of randomness could depend 
on their number, lengths or both. That's why in a real random sequence is very unusual 
to have too few or too many runs or runs of considerable length. So these information 
can be used as statistical criteria to assess if a signal is truly random. Common ap- 
proaches to define runs starting from a signal are to dichotomize it (e.g. considering its 
sign for each observation), comparing the amplitude of consecutive points within respect 
to a focal point (e.g. its mean or its median) or looking for trends. More information 
about these approaches can be found here [30]. 

4.1.3 Test based on entropy estimator 

These tests are based on the entropy of a signal or related features. In general the 
entropy is a measure of the uncertainty associated with a random variable [18]: 

Definition Let X a discrete random variable with alphabet E and probability mass 
function p(x) = Pr{X = x}, x 6 S. The entropy H(X) of a discrete random variable 
X is defined by: 

H(X) = H(p) = -J2 p(x)log 2 p(x) (4.1) 
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For example if we consider the sign test [30] (a particular run test) or a binary 
vector, it should be expected that the sequence of signs (or bits) are i.i.d. and this 
obviously follows from the fact that the positive and negative signs are equiprobable 
i.e. P(s(i) > 0) = P(s(i) < 0). If this assumption is not true, it is easy to prove that 
the entropy will be strictly less than 1. In general these tests use this null hypothesis: 

H : H{p) = 1 (4.2) 

Usually, given a signal / these tests start approximating the probability distribution 
for / and then calculating its entropy. Further details can be found in [28] and [85] . 

4.1.4 Test based on ranking: Wilcoxon rank sum test 

These tests are based on the concept of ranking, where for ranking is meant a sorting 
of the observation in non-crescent or non-descdendent order. A very popular test that 
falls in this category and that can be used to evaluate the randomness of a signal is the 
Wilcoxon rank sum test. Given two vectors of observations X and Y also of different 
lengths, test the null hypothesis that data in the vectors are independent samples from 
identical continuous distributions with equal medians, against the alternative that they 
do not have equal medians [35]. More formally: 

Given N = m + n observations X\, . . . , X m and Y\, . . . , Y n , the assumed model is: 

Xi = a i = l,...,m (4.3) 

Yj = e m+j + A j = l,...,n (4.4) 

where e m _|_i, . . . , e m + n are unobservable random variables, and A is the shift between 
the samples. Here we suppose that the iV observations are mutually independent and 
each e come from the same continuous population. 
The test consist in evaluating the null hypothesis: 

H : A = (4.5) 

The first step is to sort the N observations in increasing order and let Rj denote the 
rank of Yj in this ordering. Then the statistic W is calculated using this equation: 

W = Y,Rj (4-6) 

For a one side test of Hq versus the alternative Hi : A > 0, at a level of significance: 

reject Hq if W > w(a, m, n) 
accept Hq if W < w(a, m, n) 
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where the constant w(a,m,n) satisfies Po[W > w(a,m,n)] = a 

Let R^ 1 ' <,...,< Ry 1 ' the ordered Y ranks in the joint ranking of X and Y then the null 
distribution for W = XJ?=i Rj = Y^j=l ^ can De obtained considering that under the 
hypothesis Hq all possible ( ) assignments for [Ry\ . . . , RS n '\ have probability l/( n ) 
in this way it is possible to derive the null distribution without specifying the underling 
distributions of the e's. 

4.1.5 Test based on goodness of fit: Kolmogorov-Smirnov goodness 
of fit Test 

These tests start from a statistical model try to assess how well some observations fit 
the model. A very popular test that falls in this category and that can be used to eval- 
uate if two samples are drawn from the same distribution is the Kolmogorov-Smirnov 
goodness of fit Test [73]. This distribution free test is used to check if one sample comes 
from a particular distribution or if two samples come from the same distribution. This 
test is based on the comparison between the empirical cumulative distribution function 
and the theoretical cumulative distribution function. More formally: 

Let X a random variable with cumulative function F(x), given another cumulative 
function Fn(x) this test check the hypothesis: 

H :F(x) = F N (x),Vx (4.7) 

Let D the max absolute value of the difference between the two cumulative distribution, 



i.e. 

D= sup \F N (x)-F(x)\ (4.8) 

— oo<x<+oo 

where F(x) is the theoretical cumulative function and Fjy(x) is the cumulative distri- 
bution observed. Let x\,X2, ...,xn a random sample, Fn(X) is obtained as: 

if x < x\, 

F N {x) =\ | iix k <x<x k+l (4.9) 

1 if x > xn- 

Fn{x) is a good estimator of F(x), in fact it can be proven that Fn(x) = F{x). 

n—>oo 

At this point considering the observed value of D, and considering the theoretical dis- 
tribution of D, once fixed a confidence level a it's possible to calculate D a , then choose 
to reject or not the hypothesis Hq using the condition: 



reject Hq if D > D a 
accept Hq if D < D a 
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4.2 MLA Test of Randomness 

As it was shown in the previous Chapters, the MLA is strongly related to the class 
of methods successfully used in the analysis of very noisy data which, by using several 
views of the input data-set are especially able to recover statistical properties of a signal. 
Here a test of randomness, based on the distance of the interval lengths p.d.f 's detected 
by the Multi-Layer Analysis (MLA) will be presented. Such p.d.f's are estimated for 
each cut-set and the hypothesis test is performed against random signals generated via 
a Monte Carlo simulation. At this end the symmetrized Kullback-Leibler measure has 
been used to estimate the distribution distances. 

4.2.1 Monte Carlo simulation 

The Monte Carlo methods [54] are a class of computational algorithms that perform 
their computation using a random process to simulate or sample the possible space of 
solutions. They are used in the case when a deterministic approach are inapplicable for 
example due to the complexity of the problem. A typical scenario is the use of these 
methods to randomly sample a large number of states of a complex system so to use 
those states to model the behavior of the the whole system. The Montecarlo Method 
is used in several different contexts, but shares the same general approach depicted in 
Figure 4.1. In the MLA test of randomness, a Montecarlo Method is used to model the 
random case in term of Kullback Leibler distance applied on the interval representation 
obtained by MLA on random signals, and will be shown in the following sections: 



Monte Carlo Method(P) 

begin 

1. Define the space of inputs or solutions S 

2. Random sampling from S using a particular probability distribution P 

3. Use the sample of the previous step to perform a deterministic computation 

4. Aggregate the results of the previous step to produce the final result R 
end 

return (R) 



Figure 4.1: The general schema of MONTECARLO METHOD. 



86 Chapter 4. Test of Randomness by MLA 

4.2.2 Hypothesis test 

In order to detect the presence of structures in the signal an hypothesis test based 
on the expected probability distribution function (p.d.f.) of the segments length is 
proposed. The null hypothesis (Hq) represents a random signal and it is accepted if the 
p.d.f. of the segment lengths, p±, is compatible with a random signal distribution, pq; 
the hypothesis Hi represents a structured signal and it is accepted if the p.d.f. of the 
segment lengths is not compatible with a random signal, pq. It follows that we need 
to measure the similarity (dissimilarity) of two p.d.f. 's and set a confidence level a to 
perform the decision. 

The symmetric Kullback-Leibler measure, SKL, has been considered to evaluate 
the dissimilarity of the two distributions po, and p\ [42]: 

SKL(p ,pi) = 

where, KL is the no-symmetric Kullback-Leibler measure. In the continuous case, 
p.d.f. 's are defined in a dominion JC1 and the KL measure is defined as: 



f D\ X ) 

KL{p,q) = I p(x)log——dx 

Ji q(x) 



In the discrete case ICN and the KL become: 

Pi 



KL(p,q) = ^2 Pi log 



In order to perform the hypothesis test it is necessary to know the p.d.f. of the SKL in 
the case of a random signal. The derivation of analytical form of this p.d.f. is usually 
an hard problem that has been solved by a Monte Carlo simulation. For example in 
[36] a goodness-of-fit test for normality is introduced; it is based on Kullback-Leibler 
information and a Monte Carlo simulation is performed to derive and estimate the 
p.d.f's. In [73] an extension of the previous test is described for s-normal, exponential, 
and uniform distributions and also in this work a Monte Carlo simulation has been used 
to estimate the p.d.f. of the measure KL. 

4.2.3 Probability density functions estimation 

In this section, the simulation performed to estimate the p.d.f. 's of both the intervals 
length, IL]. (PIL^), and the SKL^ (PSKLk), at a given threshold tk will be outlined. 
Here, SKLk is the distance between the p.d.f. 's of two interval length. 

To estimate the p.d.f. of ILk, RS n , n = 1,...,N signals of length / have been 
generated, according to a normal distribution with fl and a estimated from an input 

(n) 

signal S of length L. Each signal, RS n , is then used to evaluate experimentally PIL\ ' 

(n = l,2,...,iV). 
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In the simulation, for each threshold, tk, it is then possible to derive the experimental 
distributions of the IL k in R^. Therefore k = 1, 2, ..., K normalized p.d.fs are obtained 
. PiXL with nb bins. Figures 4.2(a), 4.3(a), 4.4(a), show examples of PIL k for a 
simulation using I = 20000, L = 200000, N = 1000, K = 9, nb = 100. 

The estimation of the p.d.f. of SKL k and PSKL^, is carried out by computing 
the SKLk between the pairs (PILj^ n , PILj^ 1 '), with m ^ n. In this simulation it 

was drawn the evaluation of PSKL^ from a sample of — ^ ~ - elements by using a 
density estimation with Gaussian kernel. Figures 4.2(b), 4.3(b), 4.4(b) shows examples 
of PSKL k for a simulation using I = 20000, L = 200000, N = 1000, K = 9,nb = 100. 



Histogram of intervals lengths at level:4 



0.07 



,0.06 



ro.05 



; 0.04 



■ 0.03 




Interval lengths 

(a) 



Distribution of Kullback distances in noise at level:4 



0.5- 

0.4- 



0.2- 
0.1 - 








— Density 
-••■CI 10% 
-°-CI5% 
-•-CI 1% 



-1 



3 4 5 6 7 

Kullback distance 



(b) 



Figure 4.2: Examples of PIL k (a), and PSKL k (b) for k = 4 
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Histogram of intervals lengths at level:5 
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Figure 4.3: Examples of PIL k (a), and PSKL k (b) for k = 5 



4.3 Experimental Setup 



Here, the evaluation of the test will be presented. In particular, the test has been car- 
ried out on simulated and real data respecting a particular tiled microarray approach 
able to reveal nucleosome positioning information on DNA [90] and presented in detail 
in chapter 3. Here unlike to the case study presented in chapter 3 (in which the prob- 
lem was to infer the nucleosome positions) the problem is to investigate if the shapes 
correspondent to nucleosome binding sequences have some specific features. Results 
indicate that such statistical test may indicate the presence of structures in real and 
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Histogram of intervals lengths at level:6 
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Distribution of Kullback distances in noise at level:6 
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Figure 4.4: Examples of PIL k (a), and PSKL k (b) for k = 6 

simulated biological signals, showing also its robustness to data noise and its superiority 
to the Wilcoxon rank sum test. In Fig.s 4. 5(a), 4. 5(b), 4. 5(c) three examples of input 
signals with signal to noise ratio SNR = 1,1.5,10, are given. This allows to control 
the accuracy of the proposed test of randomness and perform the calibration of the 
methodology. The same test has been applied to the data used in the simulation phase. 



4.3.1 Assessment on synthetic data 

The input signals used to evaluate the test, are synthetically generated following the 
procedure described in [23] and in Chapter 3 and represent signals which emulate the 
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nucleosome positioning data. 

In the following, an assessment of the proposed hypothesis test to guess the presence 
of structures in an unknown input signal is performed. In this sense these can be 
considered part of an exploratory data analysis procedure. This experiment has been 
carried out generating 40 synthetic test signals of length L = 200000 (base pairs) , with 
signal to noise ratio ranging from SNR = to SNR = 10 by steps of 0.25 and N 
random samples of length / = 20000 (base pairs). The simulation used to estimate 
the p.d.f of SKL, has been done using the synthetic signals of length L (base pairs), 
N = 1000 random samples of length I (base pairs), K = 9 thresholds and nb = 100 bins. 
The number of bins has been set as a good compromise among different sample size at 
different thresholds. For each test signal S its SKLk from a random sample drawn from 
the RS n samples is computed and used to verify the test of hypothesis on the PSKLk- 
In particular for each test signal S recalling that R n , as defined above, is of the same 
length of S and generated according to a normal distribution with /2 and a estimated 
from S. In figures 4.6, 4.7, 4.8 some results of the test are provided for increasing SNR, 
for confidence level a = 99%, 95%, 90% and at different thresholds. In the abscissa 
the SNR is represented, while in the ordinate the probability that the symmetrized 
Kullback-Leibler distance falls in the interval [0, SKLk]. If the ordinate value is greater 
than the confidence a the random test is rejected. From previous results, it can be seen 
that the test is not reliable for lower and the higher thresholds while it is quite sensitive 
for intermediated thresholds. For example, for tk = 5, 6, 7, 8 and a > 90% the random 
hypothesis is rejected for SNR > 3.0, 1.5, 1.25, 1.5 respectively. Intuitively, this can 
be explained because the number of intersections is low for higher and lower threshold 
values. 

4.3.2 Assessment on real data 

The test of randomness has been applied to real biological data derived from a tiled 
microarray approach able to reveal nucleosome positioning information on the Sac- 
charomyces cerevisiae DNA [90]. The input microarray data, S, are organized in T 
contiguous fragments Si,--- ,St which represents DNA sub-sequences. This dataset 
is explained in detail in Chapter 3. 

In the experiment we set K = 10, nb = 100, for each signal fragment Si the 
corresponding intervals INTik are extracted for each threshold tk- 

Finally, the set of intervals INTk = |J i=1 INTik are used to compute the interval 
distribution length PILk- Then, the SKLk from a random sample R drawn from the 
RS n samples is computed and used to verify the test of hypothesis on the PSKLk- 
Note that, in this experiment, the length of the real signal and of the random sample 
is 20000 base pairs. Figures 4.9(a), 4.10(a), 4.11(a) show PIL k for k = 4,5,6. 

The experiment indicates that the hypothesis test is rejected at confidence level 95% 
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for k = 5, while for k = 6, 7, 8, 9 is rejected at a confidence level > 99%. In figures 4.9(b), 
4.10(b), 4.11(b) are shown the result of the test of randomness for k = 4, 5, 6. Moreover, 
the test of randomness is quite unstable for k < 4 and k = 10; this property highlights 
that the central part of the signal contains the majority of the useful information for 
the test of randomness (see Figure 4.12). 

4.3.3 Comparison with Wilcoxon rank sum test 

In this section a comparison of the results of MLA test with the Wilcoxon rank sum 
test, both on synthetic and real data is presented. Both hypothesis tests, can be applied 
when no assumption about sample distribution can be made, condition which falls in 
this case. 

Firstly, it was verified if each synthetically generated S (a total amount of 40 signal) 
and N = 100 random samples drawn from the RS n are significantly different by using 
a Wilcoxon rank sum test. Figure 4.13(a) shows the results that can be summarized 
affirming that S and a generic random signal R are at least 90% significantly different 
starting from SNR = 1.25. This reveal that the Wilcoxon test and our test have quite 
the same predictive power when considering intermediate threshold levels of the MLA 
(k=6,7,8,9) . 

In the case of a real signal S, the Wilcoxon rank sum test has rejected the 
hypothesis of randomness on S only 3 times over N = 100 tests (see figure 4.13(b)). 
This makes the Wilcoxon rank sum test not reliable for such kind of data, while our 
test, as already shown in section 4.3.2, confirms his predictive power on intermediate 
thresholds. 



In this chapter several tests have been introduced in order to check the randomness 
of a set of one dimensional signals and a new test of randomness based on the MLA 
preprocess has been also presented. It makes uses of the Symmetrized Kullback-Leibler 
distance, and it has been shown to be useful in the case of exploratory analysis in order 
to verify the possible presence of structures in an input signal. Finally, it is able to 
guess structures in the case of real and simulated data for nucleosome positioning with 
low SNR (1.5), while a simple Wilcoxon rank sum test has not shown enough reliability 
on the same kind of data. 
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Figure 4.5: Examples of input signals: (a) input signal SNR = 1; (b) input signal 
SNR = 1.5; (b) input signal SNR = 10. 
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Figure 4.6: Examples of hypothesis test at different SNR and thresholds. 
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Figure 4.7: Examples of hypothesis test at different SNR and thresholds. 
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Figure 4.8: Examples of hypothesis test at different SNR and thresholds. 
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Distribution of interval lenghts at level:4 in real signal 
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Figure 4.9: PIL^ (a) and PSKLk and hypothesis test results (b) of the real signal for 
fc = 4. 
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Distribution of interval lenghts at level:5 in real signal 
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Figure 4.10: PIL^ (a) and PSKL^ and hypothesis test results (b) of the real signal for 
k = 5. 
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Distribution of interval lenghts at level:6 in real signal 
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Figure 4.11: PIL^ (a) and PSKL^ and hypothesis test results (b) of the real signal for 
k = 6. 
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Figure 4.12: The gray strep indicates the useful part of the input signal in order to 
perform the test of randomness. 



100 



Chapter 4. Test of Randomness by MLA 







0.9 



-*- Probability 


— 10% 


- 5% 


- - 1% 



0.7 



0.6 



0.5 



0.4 



0.3 



10 



15 20 25 30 

SNR 



35 40 



(a) 




Probability 

10% 

-• 5% 

- - 1% 



(b) 



Figure 4.13: Mann- Whitney rank sum test results for different signal to noise ratio (a) 
and for the real signal (b). 



Chapter 5 

MLA and Kernel methods 



This chapter presents how the MLA can help on designing new kernel functions that 
explicitly take into account the shape information contained in a one-dimensional sig- 
nal. In the following, the main idea of kernel methods will be presented, giving more 
details to a particular subclass of kernel functions applicable to structured data, and in 
particular trees. The MLA is used to define a mapping from the set of one- dimensional 
signal to the set of trees. For this reason the main advantage of defining a kernel func- 
tion based on MLA is that it is possible to incorporate shape information directly in a 
kernel function encoded as a tree. 

5.1 Kernel methods 

Kernel methods are a class of algorithms used in the context of pattern analysis. Al- 
though initially they were developed in the context of classification, with the well known 
Support Vector Machine (SVM) method first introduced by Vapnik [82], the kernel ap- 
proach has shown to be applicable to several key problems in data analysis (Principal 
Component Analysis, Clustering, Regression, Ranking, Correlation). In this sense, 
nowadays, it is usually referred to the kernel methods, as a general framework applica- 
ble to all kinds of data [81]. In fact recently, kernel methods were developed to deal with 
data without an explicit vector representation such as complex objects or structured 
data (string, tree, graph, etc.). 

5.1.1 Main ideas of kernel methods 

The main advantage of kernel methods came from their modularity: all these methods 
consist of two parts: a kernel function, and an algorithm to analyze the data after the 
kernel mapping, as shown in figure 5.1. In particular, the kernel embeds the input 
space into a new vector space where the algorithm used to analyze the data could have 
better performance than the same algorithm applied on the original input space (see 
figure 5.2). The kernel functions represent the spatial relation between pair of data 
elements, using an inner product in the new space without explicitly map such data. 
In this way it is also possible to use infinite dimensional space without encoding the 
data explicitly with new coordinate vectors. Moreover, in many case the computation 
of the inner product could be more efficient than explicitly map each point into the 
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new vector space and computing for example the pairwise distances. This imply that 
it is not necessary to know the exact coordinates of the points in the vector space but 
only their pairwise inner product. In other words the dimensionality of the new vector 
space does not affect the computation time. This propriety is usually called "kernel- 
trick", and can be summarized saying that, to perform data analysis with a kernel, it is 
not necessary to know explicitly the vector space where the data will be projected in. 
An important point of the kernel functions is that the mapping could catch non-linear 
relation present in the data linear in the new space. This permits to take advantage 
of the large class of well understood methodologies that search linear relation in the 
data. In this way the choice of a particular kernel function is related to the vector space 
where the data points will be implicity projected. A deeper coverage of the theory and 
application of kernel methods can be found in the book by Taylor and Cristianini [75] . 
Now, it will be given the formal definition of kernels and some of their properties will 
be count. 
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Figure 5.1: General schema of kernel methods 





Figure 5.2: Kernel mapping 
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5.1.2 Formal definition and properties of kernels 

Definition Kernel Function 

Given a set I/O, and a mapping function from X to a features vector space F i.e. 

4>(x) : X — > F a kernel is a function K : X x X — > R that for all x,y £ X is: 

k(x,y) = {ct>(x),ct>(y)) (5.1) 

where (•, •) denotes the euclidean inner product on F 
It is clear that the function K is symmetric i.e.: 

k(x, y) = (0(x), Hv)) = (<Kv), fa)) = Hv, x) (5.2) 

An important theorem that provide a characterization of the class of kernel function 
is the here stated Mercer's theorem[19]: 

Theorem 5.1.1 (Mercer's Theorem) Let X a compact subset ofM. n . Suppose K is a 
continue symmetric function such that the integral operator Tk '■ £2 (a?) — > -^2(^)1 



(T K f)(-)= / K(-,x)f(x)dx (5.3) 

Jx 

is positive, that is: 

[ K(x,z)f(x)f(z)dxdz>0 (5.4) 

JXxX 
for all f £ L-z(x). Then it is possible to expand K(x, z) in a uniformly convergent series 
(on X x X) in terms ofT^ 's eig en- function cfij € L/2(X), normalized in such a way that 
||<^j||l 2 > and positive associated eigenvalues Xj > 0. 

00 
K(x,z) = ^2\ j 4> j (x)(l) j (z) (5.5) 

i=i 

A special case of this theorem is the following, that characterizes the Kernel function 
on Finite spaces. 

Theorem 5.1.2 Let X a finite input space with K(x,z) a symmetric function on X. 
Then K(x, z) is a kernel function if and only if the matrix: 

K = (K(x i ,x j ))Z J=1 (5.6) 

is positive semi- definite (has non negative eigenvalue) i.e: 

n n 

}, ^J CiCjK(xi,Xj) > (5.7) 

i=i j=i 

with n > 0, x±, . . . ,x n £ X and Ci, Cj £ R. 
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Proof Since the matrix (K(xi,Xj))^, =1 is symmetric, there exists an orthogonal matrix 
V such that: K = VAV where A is the diagonal matrix containing the eigenvalues 
At of K, and the columns of V are the corresponding eigenvectors vt = (^tj)iLi- By 
hypothesis, the eigenvalues of K are non-negative, so it is possible to define the mapping 



<f) : x t ^ (VWLi (5.8) 

And express the inner product as: 

(4>(xi), <f>( Xj )) = J2 A <^% = ( VAV % = K(x, y) (5.9) 

i=l 

And this proves that A' is a kernel function that calculate the inner product in 
the vector space given by the mapping function </>. Note that the condition of positive 
semi-definiteness is necessary, since if it exists at least a negative eigenvalue X s with 
corresponding eigenvector v s , the point: 

n 

z = Y, v«<Kxi) = y/AVvs (5.10) 

i=l 

would have a norm squared less than in that space that is impossible: 

|| z || 2 = ( z , z ) = v'sVvWAVVs = v' s VAV'v s = v' s Kv s = X s < (5.11) 

5.1.3 Kernels and distances 

A simple property of the inner product, is that it naturally induces a norm: 



\\x\\ 2 = y/{x,x) (5.12) 

and thus a metric or distance: 

d(x,z) = \\x — z\\2 (5.13) 

It follows immediately, that a generic kernel function also induces a distance: 

Definition Distance induced by a kernel function 

Given a kernel function k, and consider the Gram's matrix Gij = k(xi,Xj) = 
{(p(xi),(j)(xj)), it is possible to obtain a pairwise distance matrix Dij from G using 
the following relation: 



Dij 



J\\ 4>(xi) - 4>(xj) || 2 = Jk(xi,Xi) + k(xj,Xj) - 2k(xi,Xj) (5-14) 

As an example, let us consider the euclidean distance: 
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Definition Euclidean Distance 

Given two signals x and y their Euclidean Distance is defined as: 



d n {x,y) 



. X>i-I/i) 2 (5.15) 



where x = (xi,...,x m ), y = (yi,...,y m ). 

It is straightforward that the euclidean distance is induced by the linear kernel 
K(x,y) = xy'. 

5.2 Kernel methods for tree 

All the classes of kernel function in this category are based on the concept of tree i.e. 
the input data are represented in a tree structure. One assumes that the reader is 
familiar with the general concepts of graph theory, in particular with the definition 
of tree structure. For an appropriate background, the reader is referred to standard 
literature [7]. As stressed in the introduction, it is possible to define kernel function 
even when the input data doesn't have an explicit vector representation. This is the 
case of structured data and in particular in the case of tree structure. More in general, 
there exists a class of kernel function called Convolution Kernel and firstly introduced 
by Hausler [34] and later extended by Shin and Kuboyama [76] [77] that decompose a 
data object into simpler parts and then define a kernel function in terms of such parts. 

5.2.1 Convolution kernel 

This class of kernels are particular devoted for problem involving the processing of 
structured data like string, trees, graph. In fact it provides a way to extract real-valued 
features and thus to map these data into a vector space M. (finite case) or in the Hilbert 
space of all square summable sequences (infinite case) . The main idea of this approach 
is that in some case, it is easier to compare two objects in terms of their simpler parts 
or features. As the other kernels, it is not necessary to explicit map an input data in 
the feature space, the only requirement is the calculation of the inner product between 
two input data in the feature space. The name convolution came from the fact that the 
value of the kernel is obtained from a sum of products of other kernels, similar to the 
idea of convolution between function. 

Definition Convolution Kernel 

Let x E X a structured data, X\ , . . . Xr> non-empty separable metric spaces and x = 
{x\, . . . , xd) the subparts of x (for example in a string a subpart could be a substring) 
with each Xd £ Xd with 1 < d < D. Consider the relation R : X\ x . . . x Xd x X where 
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R(~x,x) is true if and only if x\, . . . ,xd are the subparts of x. Let i? _1 (x) = { x : 
R(x, x)} and R is said finite if i? _1 (x) is finite for all x £ X. Given two element 1,1/6 
X their decomposition x = {x\, . . . , Xn), y = (j/i, ■ • ■ ; Vd) in A"i, . . . Xz>, suppose that 
for each X d with 1 < d < D exists a kernel K d , then the Convolution Kernel is defined 

as: 

D 

K(x,y)= Yl ~[[K d (x d ,y d ) (5.16) 

The proof that K is a valid kernel can be found in the original paper [34] . 

5.2.2 Tree kernels 

In the last years a variety of convolution kernel has been proposed for different kind 
of structured data, such as string, tree and graph [29], [31], [11]. Here, only the main 
idea on kernels for trees will be presented, the interested reader can found a good 
characterization of tree kernels in the phd thesis by Kuboyama [77]. Tree kernels [14] can 
be applied to ordered trees and they compute the similarity between trees considering 
their common subtrees. There are several kind of tree kernels but all of them share 
the same idea of decomposing, in the convolution kernel framework, a tree in different 
kind of subtree (for example simple subtree or co-rooted subtree). As an example, let 
us consider a particular convolution kernel: let x S X a rooted and ordered tree and 
X\, . . . Xo the set of all D-degree ordered and rooted trees. In this case the relation 
R defined before is: R(x,x) 44> x\, . . . ,xn are the D subtrees of the tree x. in the 
following, one tree kernel used in context of Natural Language Parsing that exploit this 
idea and that has inspired several works on tree kernel (and also the MLA tree kernel) 
will be defined. 

Definition Collins and Duffy Tree Kernel[14] 

Given a tree T, and considering the enumerable set of all possible trees T = 
{Ti,T2, . . . ,T n }, T can be represented by an n-dimensional vector where the z'th com- 
ponent contains the number of occurrences of the i'th tree Tj of T in T. This mapping 
is done considering the function hi(T) that count the number of occurrences of Tj in 
T. In this way it is possible to represent a tree T as h(T) = (/ii(T), ti2(T), . . . , h n (T)). 
Note that the number n could be huge because the number of subtree of a given tree T 
is exponential on its size. The kernel is then defined as: 

K(T U T 2 ) = h(7i) • h(T 2 ) =J2hi(Ti)hi(T 2 ) = (5.17) 

i 

= E E E I ^ nl ) / ^ n2 )= E E c ^ n *) ( 5 - 18 ) 
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where iVj is the number of node in T\, N 2 is the number of node in T 2 , Ii{n) is an 
indicator function defined as: 

1 if the subtree X; is seen rooted at node n 
h(n) = \ (5.19) 

I U otherwise 

and C(ni,n 2 ) = Y^ih{ni)Ii{n 2 ) 

This kernel can computed in polynomial time, expressing C(ni,n 2 ) with the follow- 
ing recursive definition: 

• if the productions at n\ and n 2 are different: C(ni,n 2 ) = 

• if the productions at n\ and n 2 are the same and n\ and n2 are pre-terminal 
nodes: C(ni,n 2 ) = 1 

• else if the productions at n\ and n 2 are the same and n\ and n 2 are not 
pre-terminal nodes: 

nc(ni) 

C(rii,n 2 )= IJ (l + C^ni,^,^^,^)) (5.20) 

i=i 

where nc(ni) is the number of children of n\ in the tree (note that nc(ni) = nc{n2) 
because the productions are the same) and ch(nk,i) is the i'th. son of node nk in 
a tree. 

In the original paper some variant of this kernel is proposed to take into account 
some issues: 

• The value of kernels K(Ti,T2) depends strongly on the size of the trees T\ and 
Ti- A possible solution is to use a new normalized kernel defined as: 

K'(T 1 ,T 2 )= - K{ ? 1 ^ == (5.21) 

yjK{T u T{)K{T 2 ,T 2 ) 

Note that K is still a kernel function because still satisfies the theorem 5.1.2. 

• Since the number of subtree increases with size or depth, it is necessary to scale 
the importance of each subtree taking in account their sizes: 

nc(ni) 

C(ni,n 2 ) = A and C(m,n 2 ) = A [ (1+C (ch(ni, j), ch(n 2 , j))) with < A < 1 

(5.22) 
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This correspond to the kernel: 

K(T 1 ,T 2 ) = Y,^ size 'h i (T 1 )h l (T 2 ) 



(5.23) 



In order to obtain this result the parameter < A < 1 was introduced. In this 
way the kernel downweight the contributions of tree fragments exponentially with 
their size. 



5.3 MLA Kernels 

5.3.1 MLA Tree Kernel 

The MLA Tree Kernel is based on the MLA and in particular it is obtained using 
(1) the MLA on an input signal, (2) a particular aggregation rule that produce a tree 
from intervals and (3) a modified tree kernel adapted to the nature of the class of trees 
produced by the first two steps. A schematic view of the MLA Tree Kernel inserted on 
the whole process of Kernel Methods is depicted in figure 5.3. 
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Figure 5.3: General Schema of MLA Tree Kernel 
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5.3.1.1 From signal to tree 

Definition MLA tree aggregation rule 

Given a signal / denned in [a, b] and K threshold operations a^ (k = 1, ..., K) after the 

application of Equally spaced simple MLA where the condition on each sigma is: 

I <fi otherwise 

it is possible to obtain the interval representation T(/) of /, recalling that T(/) = 
{I\,l2, • • • ,Ik} with Ik = \i\, i\, ■ ■ • , %l k } the set of intervals corresponding to a^. To 
obtain a tree from the signal / it is necessary to use its interval representation T(/) 
using a particular aggregation rule on intervals. It is necessary first to introduce a 
relation R : If. x I^+i with I}, and Ik+\ £ T(/). Given two intervals i| and i k+ x they 
are in relation and it will be indicated as R(i%iV<:+l) ^ anc ^ on ^Y if *l+i ^ i\- 
Now, let us define the undirected tree T = (V, E) such as: 

K 

V = I U (J Ii with I = {r= [a, b}} (5.24) 



and 

E = {(i u i 2 ) with»i,«2 G V:i2(ii,»2)}. (5.25) 

In this way it is possible to define a labeled and rooted tree T with root r and in which 
each node encode the correspondent interval. The depth of the tree is exactly K + 1 as 
it is necessary to add the node r that represents the interval [a, b] where / is defined. 
It is possible to see an illustrative picture of the process in figure 5.4 

5.3.1.2 Proposed Tree Kernel 

This kernel is defined starting from the tree T previously defined in 5.24 and 5.25. 
The idea behind this kernel is similar to the tree kernel proposed by Collins and Duffy 
introduced in section 5.2.2. In their original work they have used the tree kernel to 
characterize parse trees, here it is shown how adapt their approach to the set of tree 
obtained by MLA and representing the class of one-dimensional signals defined for 
some interval [a, b]. The main idea of this kernel is to compare two signals using their 
tree representation. In the original kernel of Collins and Duffy each node represent 
a production rule or a terminal symbol for some formal languages, here the nodes 
represent intervals. 
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Definition MLA Tree Kernel 

Using the same convention of tree kernel presented in 5.2.2, the MLA tree kernel is 

defined as: 

K(r 1 ,T 2 ) = h(r 1 )-h(r 2 )= Y, E c ^ n ^ 5 ) ( 5 - 26 ) 

niSJVi n 2 &N 2 

where n\ and ri2 for simplicity of expression represent also the interval lengths associated 
to the nodes n± and n 2 , 5GM with < 6 < (b— a), and C(n±, rt 2 , 8) recursively defined 

as: 

• if m is a leaf and n 2 is not a leaf or viceversa then C(n\, n 2 , 5) = 

• if |m — n 2 | > 6 and the intervals are pre-terminals (both fathers of a leaf) then 
C(m,n 2 ,<5) = (rti and re 2 are considered different). 

• if |ni — n 2 | < 5 and the interval n\ and n 2 are two leafs then C(m, n 2 , 5) = 1 (ni 
and n 2 are considered equal). 

• else if | ni — n 2 | < 5 and the intervals n\ and n 2 are not both fathers of a leaf 
then: 

nc(ni) 

C{n u n 2 ,5)= fl (1 + (7(0%! , j),ch(n 2 ,j),S)) (5.27) 

i=i 

Note that this kernel suffers of the same issues as the Collins and Duffy tree kernel, 
for this reason it could be useful to consider the variant proposed in 5.21, 5.22, 5. 23. Note 
also that here the node n\ and n 2 

5.3.2 MLA Convolution Kernel 

This kernel is defined starting from the interval representation of a signal trough the 
Equally Spaced MLA defined in Chapter 2. In particular given 2 signal x,y and let 
T(x) = {Ixi, 1x2, • • • , Ixk} and T(y) = {Iyi, Iy 2 , ■ ■ ■ , lyx} their intervals representa- 
tion with K threshold operations. 

Definition MLA Convolution Kernel 

Let I a generic set of intervals from some interval representation of a signal of length 

L and let define Bj a signal of length L with: 

f 1 if 3 an interval la, b] G I such that j G fa, b] 
Bi(j) = { (5-28) 

I otherwise 

with 1 < j < L. In this way to a generic interval representation it is possible to associate 
a set of binary string. 
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Finally the kernel is defined as: 

K— hnp+1 

rip 



K— hnp+1 

S(x,y)= Yl — 



k=l+hnp 



k+hnp—1 \ / k+hnp— 1 

Yl Bi *i Yl Bi vj 

ij=k— hnp+1 J \j=k— hnp+1 



(5.29) 



where < 7 < 1 and np = \j * K\ and hnp — — 



2 ■ 



This kernel function can be seen as a local correlation between correspondent in- 
ternal portions of the signals and in which the size of the portion is controlled by the 
parameter 7. 

5.4 Support Vector Machines 

Support Vector Machines (SVM) are learning systems that use an hypothesis space 
of linear functions in an high dimensional space, trained with a learning algorithm for 
optimization motivated from statistical learning theory [19]. SVM are binary classifiers; 
in particular the discriminative function of the SVM represent a linear decision boundary 
also called margin. More formally, a SVM constructs an hyperplane in a high (eventually 
infinite) dimensional space, using the implicit projection of the kernel functions, in order 
to obtain a good separation between positive and negative points. In particular SVM 
consider the hyperplane that has the largest distance to the nearest training data points 
of any class since in general the larger the margin the lower the generalization error of 
the classifier. In figure 5.5 it is possible to see the concept of margin and the hyperplane 
(a straight line in 2 dimensions). The interested reader can found a good survey of SVM 
classifiers in [75]. 

5.5 Experimental Setup 

In this section three experiments that use the MLA Tree kernel will be presented, in 
particular the first two involve a classification, while the third is related to clustering. 

5.5.1 Synthetic data: discrimination power of MLA Tree Kernel on 
basic functions 

To validate MLA Tree Kernel, three basic signals that can be characterized in term of 
shape in time domain, has been considered (see figure 5.7): 

• sinusoid signal 

• rectangular pulse signal 

• sawtooth signal 
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Kernel Function 


Correctly Classified 


Accuracy 


MLA Tree 


150/150 


100% 


Linear 


143/150 


95% 


Polinomial(2) 


131/150 


87% 


RBF 


130/150 


87% 


Sigmoid 


141/150 


94% 



Table 5.1: Classification accuracy on basic functions dataset. 

As training set S, N signals have been generated with an increasing linear SNR 
noise value ranging from 0.1 to 1, for each of the three categories. In this way, one 
dispose of a training set with 3 x N elements and with 3 classes. Analogously a Test 
Set T disjointed from S was taken into account, with the same cardinality i.e. 3xJV. 
To validate the performances, a Support Vector Machine with different kernel functions 
has been considered: linear, polynomial, RBF, sigmoid and MLA Tree. The results 
obtained with A = 50 and with different kernels are shown in table 5.1. As it it is 
possible to see all the kernels obtain very good performances although in the case of 
very noisy signal the MLA Tree Kernel can still recover the shape information leading 
to a slightly better result. This makes the MLA tree kernel more robust to noise than 
the other kernels. 

5.5.2 Synthetic data: MLA Tree Kernel on waveform dataset 

In this experiment the dataset from [8] was considered. It contains 5000 instances 
divided in 3 classes of waves of 21 attributes, all of which include gaussian noise with 
mean and variance 1. In particular, each class is generated from a combination of 2 of 
3 "base" waves. The best accuracy that has been obtained processing this dataset has 
been reached by the Optimal Bayes classifier, with a value of 86%. Here the dataset was 
split in two balanced parts (training and test sets) of 1500 elements equally distributed 
into the tree classes for evaluating the performances of MLA Tree kernel with a SVM 
classifier. In particular as in the previous experiment, linear, polynomial, RBF, sigmoid 
and MLA Tree kernels functions have been used. In the table 5.2 results are shown. As 
it it is possible to see all the kernels obtain very good performances. 



5.5.3 Assessment of induced distance of MLA Convolution Kernel for 
clustering of seismic signal 

The dataset taken in exam for this experiment consists of n undersea explosion of an 
array of bombs at different distanced from a ship. This dataset was builded in order to 
have a well characterized set of signals to use as a benchmark for problems involving 
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Kernel Function 


Correctly Classified 


Accuracy 


MLA Tree 


1364/1500 


91% 


Linear 


1286/1500 


86% 


Polinomial(2) 


1187/1500 


80% 


RBF 


1286/1500 


86% 


Sigmoid 


795/1500 


53% 



Table 5.2: Classification accuracy on waveforms dataset. 

geological signals. In particular, the ship record for each explosion at time U a signal 
Si that express the variation on pressure level. The explosions take place at regular 
intervals of 300 seconds and each signal is sampled at lOOhz. A particularity of this 
dataset, as it is possible to see in figure 5.8, is that close temporal explosions occurs at 
similar distances from the ship. This means that given a signal Sj, with high probability 
the most similar signal in term of shape is the signal Sj+d with d close to 1 or —1 i.e. a 
signal recorded in proximity of instant U. This property allows to test in a natural way 
the performances of a similarity or dissimilarity function comparing the "order" that 
it induces on the set of signals. In particular let si, . . . , s n the set of signals recorded 
at starting time t±,...,t n respectively, and the natural order of the signals can be 
represented by the permutation P = (1,2, ... , n). Given a generic distance d, let D 
the n x n distance matrix containing all the pairwise distances between the signals i.e. 
Dij = d(si,Sj) with 1 < i,j < n. A measure of goodness of distance, can be defined 
by the distance optimality function so defined: 

Definition Distance Optimality 

Given a distance d and a dataset S of size N, and let D the pairwise distance matrix 

with D^j = d(si, Sj), Si, Sj £ S and 1 < i, j < n, the distance optimality of d is defined 

as: 



do 



£ 



J-l| 



» 



with j 



argmin D i)k 

l<k<n,kj^i 



(5.30) 



What is expected, in the case of a good distance measure, is a do ~ 0. It was 
assessed the performances of the distance induced by MLA Tree Kernel (using the 
equation 5.14 on its Gram's matrix) and compared its results with two common 
distances i.e. Euclidean distance and Spearman correlation distance by the distance 
optimality function. Note that the used Spearman correlation distance is defined as 
1 — r where r is the Spearman correlation index defined in 2 by equation 2.8. The 
results of this analysis are shown on table 5.3. As it is possible to see, the induced 
distance from MLA Convolution Kernel can exploit better the natural similarity 
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Distance 


Distance Optimality 


MLA Convolution 

Euclidean 

Pearson Correlation 


0.2369 
0.3889 
0.2813 



Table 5.3: Distance optimality on geological signals 



between signals than the other classic measures. 



This chapter has shown how the data extracted by MLA can be optimally organized 
in a tree of intervals, encoding the shape properties of a signal, using a particular 
aggregation rule. It was shown also an example of kernel trees properly adapted to be 
used with this tree representation induced by MLA. In addition another convolution 
kernel and based on local correlations was introduced. The first results are encouraging 
although it is necessary to do a more systematic study on the class of kernel functions 
that can be induced by the proposed aggregation rule on the interval representation 
and also on their properties and extensions. The major suggestion of the study carried 
out in this chapter is the connection between the class of algorithms on trees and graph 
and the class of digital signal processing technique. In fact, the MLA transformation 
can be useful to search for relation between operation on trees and graph and signal 
manipulation in time or frequency domain. 
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(a) 





(d) 



Figure 5.4: General Schema of Kernel Methods 
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Figure 5.5: SVM margin and the separation hyperplane 
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Figure 5.6: Basic function 
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Figure 5.7: Basic function plus noise 




Figure 5.8: Schema of the experiment 



Chapter 6 

Conclusions and Future Directions 



This thesis has introduced a new methodology called Multi Layer Analysis (MLA), 
and its use on several contexts such as Pattern Discovery, Classification, Clustering 
and also Test of Randomness. In chapter 3, 4, and 5 several application domains 
related to these problems have been faced with the MLA approach. In some 
sense, the use of MLA can be considered as a general boosting step to improve 
classic algorithms in the fields of classification or clustering. The main idea be- 
hind MLA is the transformation from the space of one-dimensional signals into a new 
space called the space of intervals in which a more detailed analysis could be performed. 

In particular, in chapter 3 it has been shown that, by using particular aggregation 
rules on such space, it is possible to characterize different signal shapes; this allows to 
approach some key problems in biology i.e. the nucleosome spacing problem. 

Moreover, in chapter 5 it has been proposed another aggregation rule that is capable 
to represent a one-dimensional signal in terms of a tree of intervals, and thus permits 
to express or characterize any kind of shape. This point has strong implications since it 
establishes a connection between the class of algorithms that process one-dimensional 
signal such as digital signal processing techniques, and algorithms on trees and 
graphs. This result is really important because it makes possible the application of 
particular transformations on a one-dimensional signal, modifying its tree representa- 
tion and viceversa. In this sense further investigation in this direction will be performed. 

The final consideration is that MLA can be fruitfully applied on problems that 
involve the processing of one-dimensional signals, such as Geology, Biomedicine, 
Biology and other disciplines. In some cases MLA on such problems have comparable 
or sometimes superior performances to other methodologies currently applied for 
the same purposes. Further investigation on MLA properties and its extension to 
multidimensional data will be investigated. 
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